在python代码中用换行符替换分号

Nic*_*mer 6 python parsing compilation

我想解析包含;用于分隔命令的分号的Python代码,并生成用换行替换它们的代码\n.例如,来自

def main():
    a = "a;b"; return a
Run Code Online (Sandbox Code Playgroud)

我想生产

def main():
    a = "a;b"
    return a
Run Code Online (Sandbox Code Playgroud)

任何提示?

Mar*_*ers 4

使用该tokenize查找token.OP标记,其中第二个元素是; *。将这些标记替换为token.NEWLINE标记

然而,您还需要调整令牌偏移量并生成匹配的缩进;因此,在 a 之后,NEWLINE您需要调整行号(每次插入时都会增加偏移量NEWLINE),并且“下一行”(当前行的其余部分)必须调整索引以匹配当前的缩进级别:

import tokenize

TokenInfo = getattr(tokenize, 'TokenInfo', lambda *a: a)  # Python 3 compat

def semicolon_to_newline(tokens):
    line_offset = 0
    last_indent = None
    col_offset = None  # None or an integer
    for ttype, tstr, (slno, scol), (elno, ecol), line in tokens:
        slno, elno = slno + line_offset, elno + line_offset
        if ttype in (tokenize.INDENT, tokenize.DEDENT):
            last_indent = ecol  # block is indented to this column
        elif ttype == tokenize.OP and tstr == ';':
            # swap out semicolon with a newline
            ttype = tokenize.NEWLINE
            tstr = '\n'
            line_offset += 1
            if col_offset is not None:
                scol, ecol = scol - col_offset, ecol - col_offset
            col_offset = 0  # next tokens should start at the current indent
        elif col_offset is not None:
            if not col_offset:
                # adjust column by starting column of next token
                col_offset = scol - last_indent
            scol, ecol = scol - col_offset, ecol - col_offset
            if ttype == tokenize.NEWLINE:
                col_offset = None
        yield TokenInfo(
            ttype, tstr, (slno, scol), (elno, ecol), line)

with open(sourcefile, 'r') as source, open(destination, 'w') as dest:
    generator = tokenize.generate_tokens(source.readline)
    dest.write(tokenize.untokenize(semicolon_to_newline(generator)))
Run Code Online (Sandbox Code Playgroud)

请注意,我懒得去纠正这个line值;它仅供参考,从文件中读取的数据在取消标记化时并未实际使用。

演示:

>>> from io import StringIO
>>> source = StringIO('''\
... def main():
...     a = "a;b"; return a
... ''')
>>> generator = tokenize.generate_tokens(source.readline)
>>> result = tokenize.untokenize(semicolon_to_newline(generator))
>>> print(result)
def main():
    a = "a;b"
    return a
Run Code Online (Sandbox Code Playgroud)

稍微复杂一点:

>>> source = StringIO('''\
... class Foo(object):
...     def bar(self):
...         a = 10; b = 11; c = 12
...         if self.spam:
...             x = 12; return x
...         x = 15; return y
...
...     def baz(self):
...         return self.bar;
...         # note, nothing after the semicolon
... ''')
>>> generator = tokenize.generate_tokens(source.readline)
>>> result = tokenize.untokenize(semicolon_to_newline(generator))
>>> print(result)
class Foo(object):
    def bar(self):
        a = 10
        b = 11
        c = 12
        if self.spam:
            x = 12
            return x
        x = 15
        return y

    def baz(self):
        return self.bar

        # note, nothing after the semicolon

>>> print(result.replace(' ', '.'))
class.Foo(object):
....def.bar(self):
........a.=.10
........b.=.11
........c.=.12
........if.self.spam:
............x.=.12
............return.x
........x.=.15
........return.y

....def.baz(self):
........return.self.bar
........
........#.note,.nothing.after.the.semicolon
Run Code Online (Sandbox Code Playgroud)

* Python 3 版本tokenize输出信息更丰富的TokenInfo命名元组,它有一个额外的exact_type属性,可以用来代替文本匹配:tok.exact_type == tokenize.SEMI。不过,我保持上述内容与 Python 2 和 3 兼容。