ANTLR 岛文法和消耗过多的非贪婪规则

如何解决ANTLR 岛文法和消耗过多的非贪婪规则

我在使用孤岛语法和用于消耗“除我想要的之外的所有东西”的非贪婪规则方面遇到问题。

预期结果：

我的输入文件是一个 C 头文件，包含函数声明以及类型定义、结构、注释和预处理器定义。我想要的输出只是函数声明的解析和后续转换。我想忽略其他一切。

设置和我尝试过的：

我尝试 lex 和解析的头文件非常统一和一致。每个函数声明前面都有一个链接宏 PK_linkage_m，所有函数都返回相同的类型 PK_ERROR_code_t，例如：

PK_linkage_m PK_ERROR_code_t PK_function(...);

这些标记不会出现在函数声明的开头以外的任何地方。

我将其视为孤岛文法，即文本海洋中的函数声明。我曾尝试使用链接标记 PK_linkage_m 来指示“TEXT”的结尾和 PK_ERROR_code_t 标记作为函数声明的开始。

观察到的问题：

虽然对单个函数声明进行词法分析和解析有效，但当我在一个文件中有多个函数声明时它会失败。令牌流显示“所有内容+所有函数声明+最后一个函数声明的PK_ERROR_code_t”作为文本消费，然后只有文件中的last函数声明被正确解析。

我的一行总结是：我的非贪婪语法规则在 PK_ERROR_code_t 消耗太多之前消耗所有东西。

我可能错误地认为是解决方案：

以某种方式修复我的词法分析器非贪婪规则，使其消耗所有内容，直到找到 PK_linkage_m 标记。我的非贪婪规则似乎消耗太多了。

我还没有尝试过的：

因为这是我的第一个 ANTLR 项目，也是我很长一段时间以来的第一个语言解析项目，如果我错了，我很乐意重写它。我正在考虑使用行终止符跳过所有不以换行符开头的内容，但我不确定如何使其工作，也不确定它有什么根本不同。

这是我的词法分析器文件 KernelLexer.g4：

lexer grammar KernelLexer;
// lexer should ignore everything except function declarations
// parser should never see tokens that are irrelevant

@lexer::members {
    public static final int WHITESPACE = 1;
}

PK_ERROR: 'PK_ERROR_code_t' -> mode(FUNCTION);
PK_LINK: 'PK_linkage_m';

//Doesnt work. Once it starts consuming,it doesnt stop.
TEXT_SEA: .*? PK_LINK -> skip;

TEXT_WS: ( ' ' | '\r' | '\n' | '\t' ) -> skip;

mode FUNCTION;

//These constants must go above ID rule because we want these to match first.
CONST: 'const';
OPEN_BLOCK: '(';
CLOSE_BLOCK: ');' -> mode(DEFAULT_MODE);
COMMA: ',';
STAR: '*';

COMMENTED_NAME: '/*' ID '*/';
COMMENT_RECEIVED: '/* received */' -> skip;
COMMENT_RETURNED: '/* returned */' -> skip;
COMMENT: '/*' .*? '*/' -> skip;

ID : ID_LETTER (ID_LETTER | DIGIT)*;
fragment ID_LETTER: 'a'..'z' | 'A'..'Z' | '_';
fragment DIGIT: '0'..'9';

WS: ( ' ' | '\r' | '\n' | '\t' ) -> skip;//channel(1);

这是我的解析器文件 KernelParser.g4：

parser grammar KernelParser;

options { tokenVocab=KernelLexer; }

file : func_decl+;

func_decl : PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK;

param_block: param_decl*;
param_decl: type_decl COMMENTED_NAME COMMA?;
type_decl: CONST? STAR* ID STAR* CONST?;

这是一个简单的示例输入文件：

/*some stuff*/

other stuff;

PK_linkage_m PK_ERROR_code_t PK_CLASS_ask_superclass
(
/* received */
PK_CLASS_t         /*class*/,/* a class */
/* returned */
PK_CLASS_t *const  /*superclass*/         /* immediate superclass of class */
);

/*some stuff*/
blar blar;


PK_linkage_m PK_ERROR_code_t PK_CLASS_is_subclass
(
/* received */
PK_CLASS_t           /*may_be_subclass*/,/* a potential subclass */
PK_CLASS_t           /*class*/,/* a class */
/* returned */
PK_LOGICAL_t *const  /*is_subclass*/      /* whether it was a subclass */
);


more stuff;

这是令牌输出：

line 28:0 token recognition error at: 'more stuff;\r\n'
[@0,312:326='PK_ERROR_code_t',<'PK_ERROR_code_t'>,18:13]
[@1,328:347='PK_CLASS_is_subclass',<ID>,18:29]
[@2,350:350='(',<'('>,19:0]
[@3,369:378='PK_CLASS_t',21:0]
[@4,390:408='/*may_be_subclass*/',<COMMENTED_NAME>,21:21]
[@5,409:409=',',<','>,21:40]
[@6,439:448='PK_CLASS_t',22:0]
[@7,460:468='/*class*/',22:21]
[@8,469:469=',22:30]
[@9,512:523='PK_LOGICAL_t',24:0]
[@10,525:525='*',<'*'>,24:13]
[@11,526:530='const',<'const'>,24:14]
[@12,533:547='/*is_subclass*/',24:21]
[@13,587:588=');',<');'>,25:0]
[@14,608:607='<EOF>',<EOF>,29:0]

解决方法

应对词法分析器规则“阅读除...之外的所有内容”总是很困难，但您走在正确的道路上。

在注释掉 TEXT_SEA: .*? PK_LINK ; //-> skip; 上的跳过操作后，我观察到第一个函数被第二个 TEXT_SEA 消耗（因为词法分析器规则是贪婪的，TEXT_SEA 没有机会{ {1}} 待观察）：

PK_ERROR

这让我产生了将 $ grun Kernel file -tokens input.txt line 27:0 token recognition error at: 'more stuff;' [@0,0:41='/*some stuff*/\n\nother stuff;\n\nPK_linkage_m',<TEXT_SEA>,1:0] [@1,42:292=' PK_ERROR_code_t PK_CLASS_ask_superclass\n(\n/* received */\nPK_CLASS_t /*class*/,/* a class */\n/* returned */\nPK_CLASS_t *const /*superclass*/ /* immediate superclass of class */\n);\n\n/*some stuff*/\nblar blar;\n\n\n PK_linkage_m',5:12] [@2,294:308='PK_ERROR_code_t',<'PK_ERROR_code_t'>,17:13] [@3,310:329='PK_CLASS_is_subclass',<ID>,17:29] 用作“海消费者”和函数模式启动器的想法。

TEXT_SEA

lexer grammar KernelLexer;
// lexer should ignore everything except function declarations
// parser should never see tokens that are irrelevant

@lexer::members {
    public static final int WHITESPACE = 1;
}

PK_LINK: 'PK_linkage_m' ;
TEXT_SEA: .*? PK_LINK  -> mode(FUNCTION);
LINE : .*? ( [\r\n] | EOF ) ;

mode FUNCTION;

//These constants must go above ID rule because we want these to match first.
CONST: 'const';
OPEN_BLOCK: '(';
CLOSE_BLOCK: ');' -> mode(DEFAULT_MODE);
COMMA: ',';
STAR: '*';

PK_ERROR : 'PK_ERROR_code_t' ;
COMMENTED_NAME: '/*' ID '*/';
COMMENT_RECEIVED: '/* received */' -> skip;
COMMENT_RETURNED: '/* returned */' -> skip;
COMMENT: '/*' .*? '*/' -> skip;

ID : ID_LETTER (ID_LETTER | DIGIT)*;
fragment ID_LETTER: 'a'..'z' | 'A'..'Z' | '_';
fragment DIGIT: '0'..'9';

WS: [ \t\r\n]+ -> channel(HIDDEN) ;

执行：

parser grammar KernelParser;

options { tokenVocab=KernelLexer; }

file : ( TEXT_SEA | func_decl | LINE )+;

func_decl
    :   PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK
            {System.out.println("---> Found declaration on line " + $start.getLine() + " `" + $text + "`");}
    ;

param_block: param_decl*;
param_decl: type_decl COMMENTED_NAME COMMA?;
type_decl: CONST? STAR* ID STAR* CONST?;

与其在规则的开头包含 .*?（我总是尽量避免），不如尝试匹配：

默认模式下的 PK_ERROR（并像现在一样切换到另一种模式），
或者匹配一个单个字符并跳过它？

像这样：

lexer grammar KernelLexer;

PK_ERROR : 'PK_ERROR_code_t' -> mode(FUNCTION);
OTHER    : . -> skip;

mode FUNCTION;

// the rest of your rules as you have them now

请注意，这将匹配 PK_ERROR_code_t 以及输入 "PK_ERROR_code_t_MU ..."，因此这是一种更安全的方法：

lexer grammar KernelLexer;

PK_ERROR : 'PK_ERROR_code_t' -> mode(FUNCTION);
OTHER    : ( [a-zA-Z_] [a-zA-Z_0-9]* | . ) -> skip;

mode FUNCTION;

// the rest of your rules as you have them now

您的解析器语法可能如下所示：

parser grammar KernelParser;

options { tokenVocab=KernelLexer; }

file        : func_decl+ EOF;
func_decl   : PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK;
param_block : param_decl*;
param_decl  : type_decl COMMENTED_NAME COMMA?;
type_decl   : CONST? STAR* ID STAR* CONST?;

导致您的示例输入被解析如下：