Rav*_*ell 0 python regex ply tokenize lexer
我对PLY很陌生,对 Python 也只是个初学者。我正在尝试使用PLY-3.4和 python 2.7 来学习它。请参阅下面的代码。我正在尝试创建一个令牌 QTAG,它是一个由零个或多个空格组成的字符串,后跟“Q”或“q”,后跟“.”。以及一个正整数和一个或多个空格。例如,有效的 QTAG 是
"Q.11 "
"  Q.12 "
"q.13     "
'''
   Q.14 
'''
无效的是
"asdf Q.15 "
"Q.  15 "
这是我的代码:
import ply.lex as lex
class LqbLexer:
     # List of token names.   This is always required
     tokens =  [
        'QTAG',
        'INT'
        ]
     # Regular expression rules for simple tokens
    def t_QTAG(self,t):
        r'^[ \t]*[Qq]\.[0-9]+\s+'
        t.value = int(t.value.strip()[2:])
        return t
    # A regular expression rule with some action code
    # Note addition of self parameter since we're in a class
    def t_INT(self,t):
    r'\d+'
    t.value = int(t.value)   
    return t
    # Define a rule so we can track line numbers
    def t_newline(self,t):
        r'\n+'
        print "Newline found"
        t.lexer.lineno += len(t.value)
    # A string containing ignored characters (spaces and tabs)
    t_ignore  = ' \t'
    # Error handling rule
    def t_error(self,t):
        print "Illegal character '%s'" % t.value[0]
        t.lexer.skip(1)
    # Build the lexer
    def build(self,**kwargs):
        self.lexer = lex.lex(debug=1,module=self, **kwargs)
    # Test its output
    def test(self,data):
        self.lexer.input(data)
        while True:
             tok = self.lexer.token()
             if not tok: break
             print tok
# test it
q = LqbLexer()
q.build()
#VALID inputs
q.test("Q.11 ")
q.test("  Q.12 ")
q.test("q.13     ")
q.test('''
   Q.14 
''')
# INVALID ones are
q.test("asdf Q.15 ")
q.test("Q.  15 ")
我得到的输出如下:
"Q.11 "
"  Q.12 "
"q.13     "
'''
   Q.14 
'''
请注意,只有第一个和第三个有效输入被正确标记。我无法弄清楚为什么我的其他有效输入没有被正确标记。在 t_QTAG 的文档字符串中:
'^'为'\A'无效。'^'. 然后所有有效输入都会被标记化,但第二个无效输入也会被标记化。提前感谢任何帮助!
谢谢
PS:我加入了 google-group ply-hack 并尝试在那里发帖,但我无法直接在论坛或通过电子邮件发帖。我不确定该小组是否还活跃。比兹利教授也没有回应。有任何想法吗?
最后我自己找到了答案。发布它以便其他人可能会发现它有用。
正如 @Tadgh 正确指出的那样,t_ignore = ' \t'它消耗了空格和制表符,因此我将无法按照上面的正则表达式进行匹配t_QTAG,结果是第二个有效输入没有被标记化。通过仔细阅读 PLY 文档,我了解到,如果要维护令牌正则表达式的顺序,那么它们必须在函数中定义,而不是像t_ignore. 如果使用字符串,则 PLY 会自动按最长到最短的长度对它们进行排序,并将它们附加在函数后面。我想这t_ignore很特别,因为它以某种方式先于其他任何事情执行。这部分没有明确记录。解决这个问题的方法是定义一个带有新标记的函数,例如 ,t_SPACETAB之后   和 t_QTAG只是不返回任何内容。这样,除了带有三引号的输入(包含 的多行字符串)之外,所有有效"Q.14"输入现在都已正确标记。此外,根据规范,无效的内容不会被标记化。
多行字符串问题:原来PLY内部使用了remodule。在该模块中,默认情况下,仅在字符串^的开头进行解释,而不是在每行的开头进行解释。要改变这种行为,我需要打开多行标志,这可以在正则表达式中使用. 因此,为了正确处理我的测试中的所有有效和无效字符串,正确的正则表达式是:(?m)
r'(?m)^\s*[Qq]\.[0-9]+\s+' 
这是更正后的代码,添加了一些更多测试:
import ply.lex as lex
class LqbLexer:
    # List of token names.   This is always required
    tokens = [
        'QTAG',
        'INT',
        'SPACETAB'
        ]
    # Regular expression rules for simple tokens
    def t_QTAG(self,t):
        # corrected regex
        r'(?m)^\s*[Qq]\.[0-9]+\s+'
        t.value = int(t.value.strip()[2:])
        return t
    # A regular expression rule with some action code
    # Note addition of self parameter since we're in a class
    def t_INT(self,t):
        r'\d+'
        t.value = int(t.value)    
        return t
    # Define a rule so we can track line numbers
    def t_newline(self,t):
        r'\n+'
        print "Newline found"
        t.lexer.lineno += len(t.value)
    # A string containing ignored characters (spaces and tabs)
    # Instead of t_ignore  = ' \t'
    def t_SPACETAB(self,t):
        r'[ \t]+'
        print "Space(s) and/or tab(s)"
    # Error handling rule
    def t_error(self,t):
        print "Illegal character '%s'" % t.value[0]
        t.lexer.skip(1)
    # Build the lexer
    def build(self,**kwargs):
        self.lexer = lex.lex(debug=1,module=self, **kwargs)
    # Test its output
    def test(self,data):
        self.lexer.input(data)
        while True:
             tok = self.lexer.token()
             if not tok: break
             print tok
# test it
q = LqbLexer()
q.build()
print "-============Testing some VALID inputs===========-"
q.test("Q.11 ")
q.test("  Q.12 ")
q.test("q.13     ")
q.test("""
   Q.14
""")
q.test("""
qewr
dhdhg
dfhg
   Q.15 asda
""")
# INVALID ones are
print "-============Testing some INVALID inputs===========-"
q.test("asdf Q.16 ")
q.test("Q.  17 ")
这是输出:
import ply.lex as lex
class LqbLexer:
    # List of token names.   This is always required
    tokens = [
        'QTAG',
        'INT',
        'SPACETAB'
        ]
    # Regular expression rules for simple tokens
    def t_QTAG(self,t):
        # corrected regex
        r'(?m)^\s*[Qq]\.[0-9]+\s+'
        t.value = int(t.value.strip()[2:])
        return t
    # A regular expression rule with some action code
    # Note addition of self parameter since we're in a class
    def t_INT(self,t):
        r'\d+'
        t.value = int(t.value)    
        return t
    # Define a rule so we can track line numbers
    def t_newline(self,t):
        r'\n+'
        print "Newline found"
        t.lexer.lineno += len(t.value)
    # A string containing ignored characters (spaces and tabs)
    # Instead of t_ignore  = ' \t'
    def t_SPACETAB(self,t):
        r'[ \t]+'
        print "Space(s) and/or tab(s)"
    # Error handling rule
    def t_error(self,t):
        print "Illegal character '%s'" % t.value[0]
        t.lexer.skip(1)
    # Build the lexer
    def build(self,**kwargs):
        self.lexer = lex.lex(debug=1,module=self, **kwargs)
    # Test its output
    def test(self,data):
        self.lexer.input(data)
        while True:
             tok = self.lexer.token()
             if not tok: break
             print tok
# test it
q = LqbLexer()
q.build()
print "-============Testing some VALID inputs===========-"
q.test("Q.11 ")
q.test("  Q.12 ")
q.test("q.13     ")
q.test("""
   Q.14
""")
q.test("""
qewr
dhdhg
dfhg
   Q.15 asda
""")
# INVALID ones are
print "-============Testing some INVALID inputs===========-"
q.test("asdf Q.16 ")
q.test("Q.  17 ")