I have the following grammar:
SPACE : (' '|'\t'|'\n'|'\r')+ {$channel = HIDDEN;};
NAME_TAG : 'name';
IS_TAG : 'is';
START : 'START';
END : ('END START') => 'END START' ;
WORD : 'A'..'Z'+;
rule : START NAME_TAG IS_TAG WORD END;
and want to parse languages like: "START name is END END START". The problem here is the END-token, because the 'END ' (Word + SPACE) is misinterpreted. I thought the correct approach here would be with the syntactic predicate (END-token) but maybe I am wrong.
I'd not create tokens that are 2 (or more) WORD
s separated by spaces. Why not tokenize 'END'
as and END
-token and then do something like this:
rule : START NAME_TAG IS_TAG word END START;
word : WORD | END; // expand this rule, as you see fit
NAME_TAG : 'name';
IS_TAG : 'is';
START : 'START';
END : 'END';
WORD : 'A'..'Z'+;
SPACE : (' '|'\t'|'\n'|'\r')+ {$channel = HIDDEN;};
which would parse "START name is END END START"
into the following parse tree:
What you did wrong is not to give the lexer rule the possibility to recover if the predicate failed. Here's a proper use of a predicate:
rule : START NAME_TAG IS_TAG WORD END;
SPACE : (' '|'\t'|'\n'|'\r')+ {$channel = HIDDEN;};
NAME_TAG : 'name';
IS_TAG : 'is';
START : 'START';
WORD : ('END START')=> 'END START' {$type=END;}
| 'A'..'Z'+
;
fragment END : ;