Nic*_*mer 6 python parsing compilation
我想解析包含;用于分隔命令的分号的Python代码,并生成用换行替换它们的代码\n.例如,来自
def main():
a = "a;b"; return a
Run Code Online (Sandbox Code Playgroud)
我想生产
def main():
a = "a;b"
return a
Run Code Online (Sandbox Code Playgroud)
任何提示?
使用该tokenize库查找token.OP标记,其中第二个元素是; *。将这些标记替换为token.NEWLINE标记。
然而,您还需要调整令牌偏移量并生成匹配的缩进;因此,在 a 之后,NEWLINE您需要调整行号(每次插入时都会增加偏移量NEWLINE),并且“下一行”(当前行的其余部分)必须调整索引以匹配当前的缩进级别:
import tokenize
TokenInfo = getattr(tokenize, 'TokenInfo', lambda *a: a) # Python 3 compat
def semicolon_to_newline(tokens):
line_offset = 0
last_indent = None
col_offset = None # None or an integer
for ttype, tstr, (slno, scol), (elno, ecol), line in tokens:
slno, elno = slno + line_offset, elno + line_offset
if ttype in (tokenize.INDENT, tokenize.DEDENT):
last_indent = ecol # block is indented to this column
elif ttype == tokenize.OP and tstr == ';':
# swap out semicolon with a newline
ttype = tokenize.NEWLINE
tstr = '\n'
line_offset += 1
if col_offset is not None:
scol, ecol = scol - col_offset, ecol - col_offset
col_offset = 0 # next tokens should start at the current indent
elif col_offset is not None:
if not col_offset:
# adjust column by starting column of next token
col_offset = scol - last_indent
scol, ecol = scol - col_offset, ecol - col_offset
if ttype == tokenize.NEWLINE:
col_offset = None
yield TokenInfo(
ttype, tstr, (slno, scol), (elno, ecol), line)
with open(sourcefile, 'r') as source, open(destination, 'w') as dest:
generator = tokenize.generate_tokens(source.readline)
dest.write(tokenize.untokenize(semicolon_to_newline(generator)))
Run Code Online (Sandbox Code Playgroud)
请注意,我懒得去纠正这个line值;它仅供参考,从文件中读取的数据在取消标记化时并未实际使用。
演示:
>>> from io import StringIO
>>> source = StringIO('''\
... def main():
... a = "a;b"; return a
... ''')
>>> generator = tokenize.generate_tokens(source.readline)
>>> result = tokenize.untokenize(semicolon_to_newline(generator))
>>> print(result)
def main():
a = "a;b"
return a
Run Code Online (Sandbox Code Playgroud)
稍微复杂一点:
>>> source = StringIO('''\
... class Foo(object):
... def bar(self):
... a = 10; b = 11; c = 12
... if self.spam:
... x = 12; return x
... x = 15; return y
...
... def baz(self):
... return self.bar;
... # note, nothing after the semicolon
... ''')
>>> generator = tokenize.generate_tokens(source.readline)
>>> result = tokenize.untokenize(semicolon_to_newline(generator))
>>> print(result)
class Foo(object):
def bar(self):
a = 10
b = 11
c = 12
if self.spam:
x = 12
return x
x = 15
return y
def baz(self):
return self.bar
# note, nothing after the semicolon
>>> print(result.replace(' ', '.'))
class.Foo(object):
....def.bar(self):
........a.=.10
........b.=.11
........c.=.12
........if.self.spam:
............x.=.12
............return.x
........x.=.15
........return.y
....def.baz(self):
........return.self.bar
........
........#.note,.nothing.after.the.semicolon
Run Code Online (Sandbox Code Playgroud)
* Python 3 版本tokenize输出信息更丰富的TokenInfo命名元组,它有一个额外的exact_type属性,可以用来代替文本匹配:tok.exact_type == tokenize.SEMI。不过,我保持上述内容与 Python 2 和 3 兼容。
| 归档时间: |
|
| 查看次数: |
482 次 |
| 最近记录: |