用于删除Python注释/文档字符串的脚本

Question

用于删除Python注释/文档字符串的脚本

是否有Python脚本或工具可以从Python源中删除注释和文档字符串？

它应该照顾像这样的情况:

"""
aas
"""
def f():
    m = {
        u'x':
            u'y'
        } # faake docstring ;)
    if 1:
        'string' >> m
    if 2:
        'string' , m
    if 3:
        'string' > m

Run Code Online (Sandbox Code Playgroud)

所以最后我提出了一个简单的脚本,它使用tokenize模块并删除注释标记.它似乎工作得很好,除了我无法在所有情况下删除文档字符串.看看你是否可以改进它以删除文档字符串.

import cStringIO
import tokenize

def remove_comments(src):
    """
    This reads tokens using tokenize.generate_tokens and recombines them
    using tokenize.untokenize, and skipping comment/docstring tokens in between
    """
    f = cStringIO.StringIO(src)
    class SkipException(Exception): pass
    processed_tokens = []
    last_token = None
    # go thru all the tokens and try to skip comments and docstrings
    for tok in tokenize.generate_tokens(f.readline):
        t_type, t_string, t_srow_scol, t_erow_ecol, t_line = tok

        try:
            if t_type == tokenize.COMMENT:
                raise SkipException()

            elif t_type == tokenize.STRING:

                if last_token is None or last_token[0] in [tokenize.INDENT]:
                    # FIXEME: this may remove valid strings too?
                    #raise SkipException()
                    pass

        except SkipException:
            pass
        else:
            processed_tokens.append(tok)

        last_token = tok

    return tokenize.untokenize(processed_tokens)

Run Code Online (Sandbox Code Playgroud)

此外,我想在一个非常大的脚本集合上测试它,具有良好的单元测试覆盖率.你能建议这样一个开源项目吗？

Answer 1

Dan*_*all 19

我是" mygod的作者,他用正则表达式编写了一个python解释器...... "(即pyminifier)在下面的链接中提到=).
我只是想插入并说我使用tokenizer模块(我发现这个问题=)发现了相当多的代码.

您会很高兴地注意到代码不再依赖于正则表达式并使用tokenizer来产生很好的效果.无论如何,这是remove_comments_and_docstrings()来自pyminifier 的函数
(注意:它适用于先前发布的代码中断的边缘情况):

import cStringIO, tokenize
def remove_comments_and_docstrings(source):
    """
    Returns 'source' minus comments and docstrings.
    """
    io_obj = cStringIO.StringIO(source)
    out = ""
    prev_toktype = tokenize.INDENT
    last_lineno = -1
    last_col = 0
    for tok in tokenize.generate_tokens(io_obj.readline):
        token_type = tok[0]
        token_string = tok[1]
        start_line, start_col = tok[2]
        end_line, end_col = tok[3]
        ltext = tok[4]
        # The following two conditionals preserve indentation.
        # This is necessary because we're not using tokenize.untokenize()
        # (because it spits out code with copious amounts of oddly-placed
        # whitespace).
        if start_line > last_lineno:
            last_col = 0
        if start_col > last_col:
            out += (" " * (start_col - last_col))
        # Remove comments:
        if token_type == tokenize.COMMENT:
            pass
        # This series of conditionals removes docstrings:
        elif token_type == tokenize.STRING:
            if prev_toktype != tokenize.INDENT:
        # This is likely a docstring; double-check we're not inside an operator:
                if prev_toktype != tokenize.NEWLINE:
                    # Note regarding NEWLINE vs NL: The tokenize module
                    # differentiates between newlines that start a new statement
                    # and newlines inside of operators such as parens, brackes,
                    # and curly braces.  Newlines inside of operators are
                    # NEWLINE and newlines that start new code are NL.
                    # Catch whole-module docstrings:
                    if start_col > 0:
                        # Unlabelled indentation means we're inside an operator
                        out += token_string
                    # Note regarding the INDENT token: The tokenize module does
                    # not label indentation inside of an operator (parens,
                    # brackets, and curly braces) as actual indentation.
                    # For example:
                    # def foo():
                    #     "The spaces before this docstring are tokenize.INDENT"
                    #     test = [
                    #         "The spaces before this string do not get a token"
                    #     ]
        else:
            out += token_string
        prev_toktype = token_type
        last_col = end_col
        last_lineno = end_line
    return out

Run Code Online (Sandbox Code Playgroud)

Answer 2

Ned*_*der 8

这样做的工作:

""" Strip comments and docstrings from a file.
"""

import sys, token, tokenize

def do_file(fname):
    """ Run on just one file.

    """
    source = open(fname)
    mod = open(fname + ",strip", "w")

    prev_toktype = token.INDENT
    first_line = None
    last_lineno = -1
    last_col = 0

    tokgen = tokenize.generate_tokens(source.readline)
    for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokgen:
        if 0:   # Change to if 1 to see the tokens fly by.
            print("%10s %-14s %-20r %r" % (
                tokenize.tok_name.get(toktype, toktype),
                "%d.%d-%d.%d" % (slineno, scol, elineno, ecol),
                ttext, ltext
                ))
        if slineno > last_lineno:
            last_col = 0
        if scol > last_col:
            mod.write(" " * (scol - last_col))
        if toktype == token.STRING and prev_toktype == token.INDENT:
            # Docstring
            mod.write("#--")
        elif toktype == tokenize.COMMENT:
            # Comment
            mod.write("##\n")
        else:
            mod.write(ttext)
        prev_toktype = toktype
        last_col = ecol
        last_lineno = elineno

if __name__ == '__main__':
    do_file(sys.argv[1])

Run Code Online (Sandbox Code Playgroud)

我在文档字符串和注释的位置留下存根注释,因为它简化了代码.如果你完全删除它们,你还必须摆脱它们之前的缩进.

这也有其他问题.例如,如果函数*only*具有docstring,则结果在语法上无效.此外,标签处理似乎存在一些令人困惑的问题(并非任何人都应该使用标签).基于AST而不是令牌流来查看这个想法的正确版本可能会很有趣. (2认同)

Answer 3

Sur*_*Dog 5

我找到了一种更简单的方法来使用 ast 和 astunparse 模块（可从 pip 获得）。它将代码文本转换为语法树，然后 astunparse 模块再次打印出不带注释的代码。我不得不通过简单的匹配来删除文档字符串，但它似乎有效。我一直在查看输出，到目前为止，此方法的唯一缺点是它从代码中删除了所有换行符。

import ast, astunparse

with open('my_module.py') as f:
    lines = astunparse.unparse(ast.parse(f.read())).split('\n')
    for line in lines:
        if line.lstrip()[:1] not in ("'", '"'):
            print(line)

Run Code Online (Sandbox Code Playgroud)

在我看来，这是唯一正确的方法。应该通过查看 ast 并排除文档字符串节点来替换“lstrip() ... in”命令。它依赖于 unparse 以某种方式表现，并且从不使用多行语句等。 (2认同)

Answer 4

Bas*_*asj 5

这是对Dan 的解决方案的修改，使其可用于 Python3 + 还删除空行 + 使其可供使用：

import io, tokenize, re
def remove_comments_and_docstrings(source):
    io_obj = io.StringIO(source)
    out = ""
    prev_toktype = tokenize.INDENT
    last_lineno = -1
    last_col = 0
    for tok in tokenize.generate_tokens(io_obj.readline):
        token_type = tok[0]
        token_string = tok[1]
        start_line, start_col = tok[2]
        end_line, end_col = tok[3]
        ltext = tok[4]
        if start_line > last_lineno:
            last_col = 0
        if start_col > last_col:
            out += (" " * (start_col - last_col))
        if token_type == tokenize.COMMENT:
            pass
        elif token_type == tokenize.STRING:
            if prev_toktype != tokenize.INDENT:
                if prev_toktype != tokenize.NEWLINE:
                    if start_col > 0:
                        out += token_string
        else:
            out += token_string
        prev_toktype = token_type
        last_col = end_col
        last_lineno = end_line
    out = '\n'.join(l for l in out.splitlines() if l.strip())
    return out
with open('test.py', 'r') as f:
    print(remove_comments_and_docstrings(f.read()))

Run Code Online (Sandbox Code Playgroud)

归档时间：	16 年，3 月前
查看次数：	13911 次
最近记录：	6 年，9 月前