Rav*_*ell 0 python regex ply tokenize lexer
我对PLY很陌生,对 Python 也只是个初学者。我正在尝试使用PLY-3.4和 python 2.7 来学习它。请参阅下面的代码。我正在尝试创建一个令牌 QTAG,它是一个由零个或多个空格组成的字符串,后跟“Q”或“q”,后跟“.”。以及一个正整数和一个或多个空格。例如,有效的 QTAG 是
"Q.11 "
" Q.12 "
"q.13 "
'''
Q.14
'''
Run Code Online (Sandbox Code Playgroud)
无效的是
"asdf Q.15 "
"Q. 15 "
Run Code Online (Sandbox Code Playgroud)
这是我的代码:
import ply.lex as lex
class LqbLexer:
# List of token names. This is always required
tokens = [
'QTAG',
'INT'
]
# Regular expression rules for simple tokens
def t_QTAG(self,t):
r'^[ \t]*[Qq]\.[0-9]+\s+'
t.value = int(t.value.strip()[2:])
return t
# A regular expression rule with some action code
# Note addition of self parameter since we're in a class
def t_INT(self,t):
r'\d+'
t.value = int(t.value)
return t
# Define a rule so we can track line numbers
def t_newline(self,t):
r'\n+'
print "Newline found"
t.lexer.lineno += len(t.value)
# A string containing ignored characters (spaces and tabs)
t_ignore = ' \t'
# Error handling rule
def t_error(self,t):
print "Illegal character '%s'" % t.value[0]
t.lexer.skip(1)
# Build the lexer
def build(self,**kwargs):
self.lexer = lex.lex(debug=1,module=self, **kwargs)
# Test its output
def test(self,data):
self.lexer.input(data)
while True:
tok = self.lexer.token()
if not tok: break
print tok
# test it
q = LqbLexer()
q.build()
#VALID inputs
q.test("Q.11 ")
q.test(" Q.12 ")
q.test("q.13 ")
q.test('''
Q.14
''')
# INVALID ones are
q.test("asdf Q.15 ")
q.test("Q. 15 ")
Run Code Online (Sandbox Code Playgroud)
我得到的输出如下:
"Q.11 "
" Q.12 "
"q.13 "
'''
Q.14
'''
Run Code Online (Sandbox Code Playgroud)
请注意,只有第一个和第三个有效输入被正确标记。我无法弄清楚为什么我的其他有效输入没有被正确标记。在 t_QTAG 的文档字符串中:
'^'为'\A'无效。'^'. 然后所有有效输入都会被标记化,但第二个无效输入也会被标记化。提前感谢任何帮助!
谢谢
PS:我加入了 google-group ply-hack 并尝试在那里发帖,但我无法直接在论坛或通过电子邮件发帖。我不确定该小组是否还活跃。比兹利教授也没有回应。有任何想法吗?
最后我自己找到了答案。发布它以便其他人可能会发现它有用。
正如 @Tadgh 正确指出的那样,t_ignore = ' \t'它消耗了空格和制表符,因此我将无法按照上面的正则表达式进行匹配t_QTAG,结果是第二个有效输入没有被标记化。通过仔细阅读 PLY 文档,我了解到,如果要维护令牌正则表达式的顺序,那么它们必须在函数中定义,而不是像t_ignore. 如果使用字符串,则 PLY 会自动按最长到最短的长度对它们进行排序,并将它们附加在函数后面。我想这t_ignore很特别,因为它以某种方式先于其他任何事情执行。这部分没有明确记录。解决这个问题的方法是定义一个带有新标记的函数,例如 ,t_SPACETAB之后 和 t_QTAG只是不返回任何内容。这样,除了带有三引号的输入(包含 的多行字符串)之外,所有有效"Q.14"输入现在都已正确标记。此外,根据规范,无效的内容不会被标记化。
多行字符串问题:原来PLY内部使用了remodule。在该模块中,默认情况下,仅在字符串^的开头进行解释,而不是在每行的开头进行解释。要改变这种行为,我需要打开多行标志,这可以在正则表达式中使用. 因此,为了正确处理我的测试中的所有有效和无效字符串,正确的正则表达式是:(?m)
r'(?m)^\s*[Qq]\.[0-9]+\s+'
这是更正后的代码,添加了一些更多测试:
import ply.lex as lex
class LqbLexer:
# List of token names. This is always required
tokens = [
'QTAG',
'INT',
'SPACETAB'
]
# Regular expression rules for simple tokens
def t_QTAG(self,t):
# corrected regex
r'(?m)^\s*[Qq]\.[0-9]+\s+'
t.value = int(t.value.strip()[2:])
return t
# A regular expression rule with some action code
# Note addition of self parameter since we're in a class
def t_INT(self,t):
r'\d+'
t.value = int(t.value)
return t
# Define a rule so we can track line numbers
def t_newline(self,t):
r'\n+'
print "Newline found"
t.lexer.lineno += len(t.value)
# A string containing ignored characters (spaces and tabs)
# Instead of t_ignore = ' \t'
def t_SPACETAB(self,t):
r'[ \t]+'
print "Space(s) and/or tab(s)"
# Error handling rule
def t_error(self,t):
print "Illegal character '%s'" % t.value[0]
t.lexer.skip(1)
# Build the lexer
def build(self,**kwargs):
self.lexer = lex.lex(debug=1,module=self, **kwargs)
# Test its output
def test(self,data):
self.lexer.input(data)
while True:
tok = self.lexer.token()
if not tok: break
print tok
# test it
q = LqbLexer()
q.build()
print "-============Testing some VALID inputs===========-"
q.test("Q.11 ")
q.test(" Q.12 ")
q.test("q.13 ")
q.test("""
Q.14
""")
q.test("""
qewr
dhdhg
dfhg
Q.15 asda
""")
# INVALID ones are
print "-============Testing some INVALID inputs===========-"
q.test("asdf Q.16 ")
q.test("Q. 17 ")
Run Code Online (Sandbox Code Playgroud)
这是输出:
import ply.lex as lex
class LqbLexer:
# List of token names. This is always required
tokens = [
'QTAG',
'INT',
'SPACETAB'
]
# Regular expression rules for simple tokens
def t_QTAG(self,t):
# corrected regex
r'(?m)^\s*[Qq]\.[0-9]+\s+'
t.value = int(t.value.strip()[2:])
return t
# A regular expression rule with some action code
# Note addition of self parameter since we're in a class
def t_INT(self,t):
r'\d+'
t.value = int(t.value)
return t
# Define a rule so we can track line numbers
def t_newline(self,t):
r'\n+'
print "Newline found"
t.lexer.lineno += len(t.value)
# A string containing ignored characters (spaces and tabs)
# Instead of t_ignore = ' \t'
def t_SPACETAB(self,t):
r'[ \t]+'
print "Space(s) and/or tab(s)"
# Error handling rule
def t_error(self,t):
print "Illegal character '%s'" % t.value[0]
t.lexer.skip(1)
# Build the lexer
def build(self,**kwargs):
self.lexer = lex.lex(debug=1,module=self, **kwargs)
# Test its output
def test(self,data):
self.lexer.input(data)
while True:
tok = self.lexer.token()
if not tok: break
print tok
# test it
q = LqbLexer()
q.build()
print "-============Testing some VALID inputs===========-"
q.test("Q.11 ")
q.test(" Q.12 ")
q.test("q.13 ")
q.test("""
Q.14
""")
q.test("""
qewr
dhdhg
dfhg
Q.15 asda
""")
# INVALID ones are
print "-============Testing some INVALID inputs===========-"
q.test("asdf Q.16 ")
q.test("Q. 17 ")
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3246 次 |
| 最近记录: |