在python代码中用换行符替换分号

Question

在python代码中用换行符替换分号

我想解析包含;用于分隔命令的分号的Python代码,并生成用换行替换它们的代码\n.例如,来自

def main():
    a = "a;b"; return a

Run Code Online (Sandbox Code Playgroud)

我想生产

def main():
    a = "a;b"
    return a

Run Code Online (Sandbox Code Playgroud)

任何提示？

Answer 1

Mar*_*ers 4

使用该tokenize库查找token.OP标记，其中第二个元素是; ^*。将这些标记替换为token.NEWLINE标记。

然而，您还需要调整令牌偏移量并生成匹配的缩进；因此，在 a 之后，NEWLINE您需要调整行号（每次插入时都会增加偏移量NEWLINE），并且“下一行”（当前行的其余部分）必须调整索引以匹配当前的缩进级别：

import tokenize

TokenInfo = getattr(tokenize, 'TokenInfo', lambda *a: a)  # Python 3 compat

def semicolon_to_newline(tokens):
    line_offset = 0
    last_indent = None
    col_offset = None  # None or an integer
    for ttype, tstr, (slno, scol), (elno, ecol), line in tokens:
        slno, elno = slno + line_offset, elno + line_offset
        if ttype in (tokenize.INDENT, tokenize.DEDENT):
            last_indent = ecol  # block is indented to this column
        elif ttype == tokenize.OP and tstr == ';':
            # swap out semicolon with a newline
            ttype = tokenize.NEWLINE
            tstr = '\n'
            line_offset += 1
            if col_offset is not None:
                scol, ecol = scol - col_offset, ecol - col_offset
            col_offset = 0  # next tokens should start at the current indent
        elif col_offset is not None:
            if not col_offset:
                # adjust column by starting column of next token
                col_offset = scol - last_indent
            scol, ecol = scol - col_offset, ecol - col_offset
            if ttype == tokenize.NEWLINE:
                col_offset = None
        yield TokenInfo(
            ttype, tstr, (slno, scol), (elno, ecol), line)

with open(sourcefile, 'r') as source, open(destination, 'w') as dest:
    generator = tokenize.generate_tokens(source.readline)
    dest.write(tokenize.untokenize(semicolon_to_newline(generator)))

Run Code Online (Sandbox Code Playgroud)

请注意，我懒得去纠正这个line值；它仅供参考，从文件中读取的数据在取消标记化时并未实际使用。

演示：

>>> from io import StringIO
>>> source = StringIO('''\
... def main():
...     a = "a;b"; return a
... ''')
>>> generator = tokenize.generate_tokens(source.readline)
>>> result = tokenize.untokenize(semicolon_to_newline(generator))
>>> print(result)
def main():
    a = "a;b"
    return a

Run Code Online (Sandbox Code Playgroud)

稍微复杂一点：

>>> source = StringIO('''\
... class Foo(object):
...     def bar(self):
...         a = 10; b = 11; c = 12
...         if self.spam:
...             x = 12; return x
...         x = 15; return y
...
...     def baz(self):
...         return self.bar;
...         # note, nothing after the semicolon
... ''')
>>> generator = tokenize.generate_tokens(source.readline)
>>> result = tokenize.untokenize(semicolon_to_newline(generator))
>>> print(result)
class Foo(object):
    def bar(self):
        a = 10
        b = 11
        c = 12
        if self.spam:
            x = 12
            return x
        x = 15
        return y

    def baz(self):
        return self.bar

        # note, nothing after the semicolon

>>> print(result.replace(' ', '.'))
class.Foo(object):
....def.bar(self):
........a.=.10
........b.=.11
........c.=.12
........if.self.spam:
............x.=.12
............return.x
........x.=.15
........return.y

....def.baz(self):
........return.self.bar
........
........#.note,.nothing.after.the.semicolon

Run Code Online (Sandbox Code Playgroud)

^* Python 3 版本tokenize输出信息更丰富的TokenInfo命名元组，它有一个额外的exact_type属性，可以用来代替文本匹配：tok.exact_type == tokenize.SEMI。不过，我保持上述内容与 Python 2 和 3 兼容。

归档时间：	9 年，4 月前
查看次数：	482 次
最近记录：	9 年，4 月前