Python Lex-Yacc(PLY)：无法识别行开头或字符串开头

Question

Python Lex-Yacc(PLY)：无法识别行开头或字符串开头

Rav*_*ell 0 python regex ply tokenize lexer

我对PLY很陌生，对 Python 也只是个初学者。我正在尝试使用PLY-3.4和 python 2.7 来学习它。请参阅下面的代码。我正在尝试创建一个令牌 QTAG，它是一个由零个或多个空格组成的字符串，后跟“Q”或“q”，后跟“.”。以及一个正整数和一个或多个空格。例如，有效的 QTAG 是

"Q.11 "
"  Q.12 "
"q.13     "
'''
   Q.14 
'''

Run Code Online (Sandbox Code Playgroud)

无效的是

"asdf Q.15 "
"Q.  15 "

Run Code Online (Sandbox Code Playgroud)

这是我的代码：

import ply.lex as lex

class LqbLexer:
     # List of token names.   This is always required
     tokens =  [
        'QTAG',
        'INT'
        ]


     # Regular expression rules for simple tokens

    def t_QTAG(self,t):
        r'^[ \t]*[Qq]\.[0-9]+\s+'
        t.value = int(t.value.strip()[2:])
        return t

    # A regular expression rule with some action code
    # Note addition of self parameter since we're in a class
    def t_INT(self,t):
    r'\d+'
    t.value = int(t.value)   
    return t


    # Define a rule so we can track line numbers
    def t_newline(self,t):
        r'\n+'
        print "Newline found"
        t.lexer.lineno += len(t.value)

    # A string containing ignored characters (spaces and tabs)
    t_ignore  = ' \t'

    # Error handling rule
    def t_error(self,t):
        print "Illegal character '%s'" % t.value[0]
        t.lexer.skip(1)

    # Build the lexer
    def build(self,**kwargs):
        self.lexer = lex.lex(debug=1,module=self, **kwargs)

    # Test its output
    def test(self,data):
        self.lexer.input(data)
        while True:
             tok = self.lexer.token()
             if not tok: break
             print tok

# test it
q = LqbLexer()
q.build()
#VALID inputs
q.test("Q.11 ")
q.test("  Q.12 ")
q.test("q.13     ")
q.test('''
   Q.14 
''')
# INVALID ones are
q.test("asdf Q.15 ")
q.test("Q.  15 ")

Run Code Online (Sandbox Code Playgroud)

我得到的输出如下：

"Q.11 "
"  Q.12 "
"q.13     "
'''
   Q.14 
'''

Run Code Online (Sandbox Code Playgroud)

请注意，只有第一个和第三个有效输入被正确标记。我无法弄清楚为什么我的其他有效输入没有被正确标记。在 t_QTAG 的文档字符串中：

替换'^'为'\A'无效。
我尝试通过删除'^'. 然后所有有效输入都会被标记化，但第二个无效输入也会被标记化。

提前感谢任何帮助！

谢谢

PS：我加入了 google-group ply-hack 并尝试在那里发帖，但我无法直接在论坛或通过电子邮件发帖。我不确定该小组是否还活跃。比兹利教授也没有回应。有任何想法吗？

Answer 1

Rav*_*ell 5

最后我自己找到了答案。发布它以便其他人可能会发现它有用。

正如 @Tadgh 正确指出的那样，t_ignore = ' \t'它消耗了空格和制表符，因此我将无法按照上面的正则表达式进行匹配t_QTAG，结果是第二个有效输入没有被标记化。通过仔细阅读 PLY 文档，我了解到，如果要维护令牌正则表达式的顺序，那么它们必须在函数中定义，而不是像t_ignore. 如果使用字符串，则 PLY 会自动按最长到最短的长度对它们进行排序，并将它们附加在函数后面。我想这t_ignore很特别，因为它以某种方式先于其他任何事情执行。这部分没有明确记录。解决这个问题的方法是定义一个带有新标记的函数，例如，t_SPACETAB之后和 t_QTAG只是不返回任何内容。这样，除了带有三引号的输入（包含的多行字符串）之外，所有有效"Q.14"输入现在都已正确标记。此外，根据规范，无效的内容不会被标记化。

多行字符串问题：原来PLY内部使用了remodule。在该模块中，默认情况下，仅在字符串^的开头进行解释，而不是在每行的开头进行解释。要改变这种行为，我需要打开多行标志，这可以在正则表达式中使用. 因此，为了正确处理我的测试中的所有有效和无效字符串，正确的正则表达式是：(?m)

r'(?m)^\s*[Qq]\.[0-9]+\s+'

这是更正后的代码，添加了一些更多测试：

import ply.lex as lex class LqbLexer: # List of token names. This is always required tokens = [ 'QTAG', 'INT', 'SPACETAB' ] # Regular expression rules for simple tokens def t_QTAG(self,t): # corrected regex r'(?m)^\s*[Qq]\.[0-9]+\s+' t.value = int(t.value.strip()[2:]) return t # A regular expression rule with some action code # Note addition of self parameter since we're in a class def t_INT(self,t): r'\d+' t.value = int(t.value) return t # Define a rule so we can track line numbers def t_newline(self,t): r'\n+' print "Newline found" t.lexer.lineno += len(t.value) # A string containing ignored characters (spaces and tabs) # Instead of t_ignore = ' \t' def t_SPACETAB(self,t): r'[ \t]+' print "Space(s) and/or tab(s)" # Error handling rule def t_error(self,t): print "Illegal character '%s'" % t.value[0] t.lexer.skip(1) # Build the lexer def build(self,**kwargs): self.lexer = lex.lex(debug=1,module=self, **kwargs) # Test its output def test(self,data): self.lexer.input(data) while True: tok = self.lexer.token() if not tok: break print tok # test it q = LqbLexer() q.build() print "-============Testing some VALID inputs===========-" q.test("Q.11 ") q.test(" Q.12 ") q.test("q.13 ") q.test(""" Q.14 """) q.test(""" qewr dhdhg dfhg Q.15 asda """) # INVALID ones are print "-============Testing some INVALID inputs===========-" q.test("asdf Q.16 ") q.test("Q. 17 ")
Run Code Online (Sandbox Code Playgroud)
这是输出：

import ply.lex as lex class LqbLexer: # List of token names. This is always required tokens = [ 'QTAG', 'INT', 'SPACETAB' ] # Regular expression rules for simple tokens def t_QTAG(self,t): # corrected regex r'(?m)^\s*[Qq]\.[0-9]+\s+' t.value = int(t.value.strip()[2:]) return t # A regular expression rule with some action code # Note addition of self parameter since we're in a class def t_INT(self,t): r'\d+' t.value = int(t.value) return t # Define a rule so we can track line numbers def t_newline(self,t): r'\n+' print "Newline found" t.lexer.lineno += len(t.value) # A string containing ignored characters (spaces and tabs) # Instead of t_ignore = ' \t' def t_SPACETAB(self,t): r'[ \t]+' print "Space(s) and/or tab(s)" # Error handling rule def t_error(self,t): print "Illegal character '%s'" % t.value[0] t.lexer.skip(1) # Build the lexer def build(self,**kwargs): self.lexer = lex.lex(debug=1,module=self, **kwargs) # Test its output def test(self,data): self.lexer.input(data) while True: tok = self.lexer.token() if not tok: break print tok # test it q = LqbLexer() q.build() print "-============Testing some VALID inputs===========-" q.test("Q.11 ") q.test(" Q.12 ") q.test("q.13 ") q.test(""" Q.14 """) q.test(""" qewr dhdhg dfhg Q.15 asda """) # INVALID ones are print "-============Testing some INVALID inputs===========-" q.test("asdf Q.16 ") q.test("Q. 17 ")
Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，5 月前
查看次数：	3246 次
最近记录：	11 年，5 月前