How to balance rules and terminals in python lark parser?

Question

How to balance rules and terminals in python lark parser?

ach*_*zot 3 python parsing ebnf lark-parser

I'm using lark, an excellent python parsing library.

It provides an Earley and LALR(1) parser and is defined through a custom EBNF format. (EBNF stands for Extended Backus–Naur form).

Lowercase definitions are rules, uppercase definitions are terminals. Lark also provides a weight for uppercase definitions to prioritize the matching.

I'm trying to define a grammar but I'm stuck with a behavior I can't seem to balance.

I have some rules with unnamed literals (the strings or characters between double-quotes):

directives: directive+
directive: "@" NAME arguments ?
directive_definition: description? "directive" "@" NAME arguments? "on" directive_locations
directive_locations: "SCALAR" | "OBJECT" | "ENUM"

arguments: "(" argument+ ")"
argument: NAME ":" value

union_type_definition: description? "union" NAME directives? union_member_types?

union_member_types: "=" NAME ("|" NAME)*

description: STRING | LONG_STRING    

STRING: /("(?!"").*?(?<!\\)(\\\\)*?"|'(?!'').*?(?<!\\)(\\\\)*?')/i
LONG_STRING: /(""".*?(?<!\\)(\\\\)*?"""|'''.*?(?<!\\)(\\\\)*?''')/is
NAME.2: /[_A-Za-z][_0-9A-Za-z]*/

Run Code Online (Sandbox Code Playgroud)

It works well for 99% of use case. But if, in my parsed language, I use a directive which is called directive, everything breaks:

union Foo @something(test: 42) = Bar | Baz   # This works
union Foo @directive(test: 42) = Bar | Baz   # This fails

Run Code Online (Sandbox Code Playgroud)

Here, the directive string is matched on the unnamed literal in the directive_definition rule when it should match the NAME.2 terminal.

如何平衡/调整此值，以便LALR（1）解析器没有歧义？

Answer 1

Ere*_*rez 6

云雀的作者在这里。

发生这种误解的原因是“指令”可以是两个不同的标记：“指令”字符串或NAME。默认情况下，Lark的LALR词法分析器始终选择更具体的字符串，即字符串。

那么，如何让词法分析器知道这@directive是一个名称，而不仅仅是两个常量字符串？

解决方案1-使用上下文词法分析器

在这种情况下（如果没有完整的语法很难确定）可能会有所帮助，是使用上下文词法分析器，而不是标准的LALR（1）词法分析器。

上下文词法分析器可以与解析器进行某种程度的通信，以确定哪个终端在每个点上都更有意义。这是Lark特有的算法，您可以像这样使用它：

parser = Lark(grammar, parser="lalr", lexer="contextual")

Run Code Online (Sandbox Code Playgroud)

（此词法分析器可以执行标准词法分析器可以做的所有工作，因此在将来的版本中，它可能会成为默认词法分析器。）

解决方案2-终端前缀

如果上下文词法分析器不能解决您的冲突，则针对这种情况的更“经典”的解决方案是定义指令令牌，例如：

DIRECTIVE: "@" NAME

Run Code Online (Sandbox Code Playgroud)

与您的指令规则不同，这对词法分析器没有任何歧义。指令与“指令”字符串（或NAME终端）之间有明显的区别。

而且，如果所有其他方法都失败了，那么您始终可以使用Earley解析器，以性能为代价，它将与您提供的任何语法兼容，无论可能发生多少次冲突。

希望这可以帮助！

编辑：我只是想指出上下文词法分析器现在是LALR的默认语言，因此足以调用：

parser = Lark(grammar, parser="lalr")

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，10 月前
查看次数：	1575 次
最近记录：	6 年，11 月前