Google Research: Solving ANTLR errors using ANTLRWorks

Solving ANTLR grammar errors can be very difficult, especially in complex grammar files.

Below is a simple example, based on the GQL ANTLR-grammar used in FxGqlC.
(reduced to illustrate the problem. A complete grammar can be found here. GQL is a domain language similar to SQL / T-SQL)

grammar sql;
select_command
: SELECT (WS top_clause)? WS column_list EOF
;
top_clause
: TOP expression
;
column_list
: expression (WS? ',' WS? expression)*
;
expression
: expression_3
;
expression_3
: expression_2 (WS? op_3 WS? expression_2)*
;
op_3 : '+' | '-' | '&' | '|' | '^'
;
expression_2
: expression_1 (WS? op_2 WS? expression_1)*
;
op_2 : '*' | '/' | '%'
;
expression_1
: op_1 WS? expression_1
| expression_atom
;
op_1 : '~' | '+' | '-'
;
expression_atom
: NUMBER
| '(' WS? expression WS? ')'
;
SELECT : 'select' ;
TOP : 'top' ;
NUMBER : DIGIT+;
WS
: (' '|'\t'|'\n'|'\r'|'\u000C')+
;

fragment DIGIT : '0'..'9';

The 3 expression "levels" are used to handle operator precedence. The grammar is designed to be able to parse expressions like:

SELECT 17
SELECT 17 * 14 + 3
SELECT 17 + 14 + 3
SELECT - 17
SELECT 17 * - 14 + 3
SELECT 17 + 14 + - 3
SELECT TOP 3 17
...

When trying to 'compile' or 'Interpret' the grammar in ANTLRWorks, you get this error:

[11:36:44] error(211): <notsaved>:21:43: [fatal] rule expression_3 has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
[11:36:44] warning(200): <notsaved>:21:43:
Decision can match input such as "WS {'+', '-'} WS NUMBER" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input

Solving this error just by analyzing the grammar is quite a challenge, even for this very simple example. When using a large grammar file it is nearly impossible.
But ANTLRWorks has a very useful tool to show what's going wrong.

The error message indicates that there is a problem with expression_3 (expression_3 is also indicated in red in the list of rules/tokens in the left pane).
Put your cursor in expression_3, and select the tab "Syntax Diagram" in the lower pane.
First, in the lower pane, select "Alternatives '1'" in the upper right corner.
==> In green you see how the grammar matches "WS '+' WS NUMBER", which is exactly what we want.
Next, select "Alternatives '2'" in the upper right corner.
==> In red you see how the grammar matches "WS '+' WS NUMBER".
In the latter case, you can see that the matching starts in the TOP-clause.

This is what's happening: there can be an ambiguity when parsing "SELECT TOP 1 + 2 + 20".
It is not clear where the top-clause ends and the column-list starts. Both '+' signs can be unary or binary.

It can be: "SELECT [TOP 1] [+ 2 + 20]", being equivalent to "SELECT TOP 1 22"
Or it can be: "SELECT [TOP 1 + 2] [+ 20]", being equivalent to "SELECT TOP 3 20"

This ambiguity must be resolved, because only one interpretation should be valid.
In this specific case, the grammar could be changed in a way that the top-clause expression should always have parentheses surrounding it when it is not a simple number.
This can easily be achieved by changing:

top_clause
: TOP expression
;

to:

top_clause : TOP expression_atom ;

This solves the ambiquity. The text "SELECT TOP 1 + 2 + 20" is now parsed as "SELECT [TOP 1] [+ 2 + 20]".
And if somebody wants to use "1 + 2" in the TOP-clause, he should use: "SELECT TOP (1 + 2) + 20", which is parsed as: "SELECT [TOP (1 + 2)] [+ 20]"

Below you find the complete example, with the TOP-clause corrected:

grammar sql;
select_command
: SELECT (WS top_clause)? WS column_list EOF
;
top_clause
: TOP expression_atom
;
column_list
: expression (WS? ',' WS? expression)*
;

expression
: expression_3
;
expression_3
: expression_2 (WS? op_3 WS? expression_2)*
;
op_3 : '+' | '-' | '&' | '|' | '^'
;
expression_2
: expression_1 (WS? op_2 WS? expression_1)*
;
op_2 : '*' | '/' | '%'
;
expression_1
: op_1 WS? expression_1
| expression_atom
;
op_1 : '~' | '+' | '-'
;
expression_atom
: NUMBER
| '(' WS? expression WS? ')'
;
SELECT : 'select' ;
TOP : 'top' ;
NUMBER : DIGIT+;
WS
: (' '|'\t'|'\n'|'\r'|'\u000C')+
;

fragment DIGIT : '0'..'9';

Google Research

segunda-feira, 27 de agosto de 2012

Solving ANTLR errors using ANTLRWorks

Nenhum comentário:

Postar um comentário