ach*_*zot 3 python parsing ebnf lark-parser
I'm using lark, an excellent python parsing library.
It provides an Earley and LALR(1) parser and is defined through a custom EBNF format. (EBNF stands for Extended Backus–Naur form).
Lowercase definitions are rules, uppercase definitions are terminals. Lark also provides a weight for uppercase definitions to prioritize the matching.
I'm trying to define a grammar but I'm stuck with a behavior I can't seem to balance.
I have some rules with unnamed literals (the strings or characters between double-quotes):
directives: directive+
directive: "@" NAME arguments ?
directive_definition: description? "directive" "@" NAME arguments? "on" directive_locations
directive_locations: "SCALAR" | "OBJECT" | "ENUM"
arguments: "(" argument+ ")"
argument: NAME ":" value
union_type_definition: description? "union" NAME directives? union_member_types?
union_member_types: "=" NAME ("|" NAME)*
description: STRING | LONG_STRING
STRING: /("(?!"").*?(?<!\\)(\\\\)*?"|'(?!'').*?(?<!\\)(\\\\)*?')/i
LONG_STRING: /(""".*?(?<!\\)(\\\\)*?"""|'''.*?(?<!\\)(\\\\)*?''')/is
NAME.2: /[_A-Za-z][_0-9A-Za-z]*/
Run Code Online (Sandbox Code Playgroud)
It works well for 99% of use case. But if, in my parsed language, I use a directive which is called directive, everything breaks:
union Foo @something(test: 42) = Bar | Baz # This works
union Foo @directive(test: 42) = Bar | Baz # This fails
Run Code Online (Sandbox Code Playgroud)
Here, the directive string is matched on the unnamed literal in the directive_definition rule when it should match the NAME.2 terminal.
如何平衡/调整此值,以便LALR(1)解析器没有歧义?
云雀的作者在这里。
发生这种误解的原因是“指令”可以是两个不同的标记:“指令”字符串或NAME。默认情况下,Lark的LALR词法分析器始终选择更具体的字符串,即字符串。
那么,如何让词法分析器知道这@directive是一个名称,而不仅仅是两个常量字符串?
解决方案1-使用上下文词法分析器
在这种情况下(如果没有完整的语法很难确定)可能会有所帮助,是使用上下文词法分析器,而不是标准的LALR(1)词法分析器。
上下文词法分析器可以与解析器进行某种程度的通信,以确定哪个终端在每个点上都更有意义。这是Lark特有的算法,您可以像这样使用它:
parser = Lark(grammar, parser="lalr", lexer="contextual")
Run Code Online (Sandbox Code Playgroud)
(此词法分析器可以执行标准词法分析器可以做的所有工作,因此在将来的版本中,它可能会成为默认词法分析器。)
解决方案2-终端前缀
如果上下文词法分析器不能解决您的冲突,则针对这种情况的更“经典”的解决方案是定义指令令牌,例如:
DIRECTIVE: "@" NAME
Run Code Online (Sandbox Code Playgroud)
与您的指令规则不同,这对词法分析器没有任何歧义。指令与“指令”字符串(或NAME终端)之间有明显的区别。
而且,如果所有其他方法都失败了,那么您始终可以使用Earley解析器,以性能为代价,它将与您提供的任何语法兼容,无论可能发生多少次冲突。
希望这可以帮助!
编辑:我只是想指出上下文词法分析器现在是LALR的默认语言,因此足以调用:
parser = Lark(grammar, parser="lalr")
Run Code Online (Sandbox Code Playgroud)