用于删除C和C++注释的Python代码段

Question

用于删除C和C++注释的Python代码段

Tom*_*omZ 41 c c++ python regex comments

我正在寻找从代码中删除C和C++注释的Python代码.(假设字符串包含整个C源文件.)

我意识到我可以使用正则表达式匹配.match()子串,但这不能解决嵌套问题/*,或者//内部问题/* */.

理想情况下,我更喜欢一个能够正确处理尴尬案例的非天真实现.

Answer 1

Mar*_*rot 81

它处理C++风格的注释,C风格的注释,字符串和简单的嵌套.

def comment_remover(text):
    def replacer(match):
        s = match.group(0)
        if s.startswith('/'):
            return " " # note: a space and not an empty string
        else:
            return s
    pattern = re.compile(
        r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
        re.DOTALL | re.MULTILINE
    )
    return re.sub(pattern, replacer, text)

Run Code Online (Sandbox Code Playgroud)

需要包含字符串,因为其中的注释标记不会启动注释.

编辑: re.sub没有带任何标志,所以必须先编译模式.

Edit2:添加了字符文字,因为它们可能包含引号,否则这些引号会被识别为字符串分隔符.

EDIT3:修正了一个法律表达的情况下int/**/x=5;将成为intx=5;这不会编译,用空格而不是一个空字符串替换评论.

您还可以通过将第一个返回更改为:return""+"\n"*s.count('\n')来保留相对于输入文件的行编号.我需要在我的情况下执行此操作. (2认同)

Answer 2

Kon*_*lph 25

C(和C++)注释不能嵌套.正则表达式运作良好:

//.*?\n|/\*.*?\*/

Run Code Online (Sandbox Code Playgroud)

这需要"单行"标志(Re.S),因为C注释可以跨越多行.

def stripcomments(text):
    return re.sub('//.*?\n|/\*.*?\*/', '', text, flags=re.S)

Run Code Online (Sandbox Code Playgroud)

这段代码应该有效.

/编辑:请注意,我的上述代码实际上是关于行结尾的假设!此代码不适用于Mac文本文件.但是,这可以相对容易地修改:

//.*?(\r\n?|\n)|/\*.*?\*/

Run Code Online (Sandbox Code Playgroud)

这个正则表达式应该适用于所有文本文件,无论它们的行结尾如何(包括Windows,Unix和Mac行结尾).

/编辑:MizardX和Brian(在评论中)对字符串的处理做了有效的评论.我完全忘记了这一点,因为上面的正则表达式是从一个解析模块中提取出来的,它对字符串有额外的处理.MizardX的解决方案应该可以很好地工作,但它只处理双引号字符串.

1.使用`$`和re.MULTILINE而不是''\n','\ r \n'等 (3认同)

Answer 3

Jon*_*ler 6

不要忘记在C中,在处理注释之前消除了反斜杠换行符,并且在此之前处理了三字符(因为?? /是反斜杠的三字符).我有一个名为SCC的C程序(条带C/C++注释),这里是测试代码的一部分......

" */ /* SCC has been trained to know about strings /* */ */"!
"\"Double quotes embedded in strings, \\\" too\'!"
"And \
newlines in them"

"And escaped double quotes at the end of a string\""

aa '\\
n' OK
aa "\""
aa "\
\n"

This is followed by C++/C99 comment number 1.
// C++/C99 comment with \
continuation character \
on three source lines (this should not be seen with the -C fla
The C++/C99 comment number 1 has finished.

This is followed by C++/C99 comment number 2.
/\
/\
C++/C99 comment (this should not be seen with the -C flag)
The C++/C99 comment number 2 has finished.

This is followed by regular C comment number 1.
/\
*\
Regular
comment
*\
/
The regular C comment number 1 has finished.

/\
\/ This is not a C++/C99 comment!

This is followed by C++/C99 comment number 3.
/\
\
\
/ But this is a C++/C99 comment!
The C++/C99 comment number 3 has finished.

/\
\* This is not a C or C++  comment!

This is followed by regular C comment number 2.
/\
*/ This is a regular C comment *\
but this is just a routine continuation *\
and that was not the end either - but this is *\
\
/
The regular C comment number 2 has finished.

This is followed by regular C comment number 3.
/\
\
\
\
* C comment */

Run Code Online (Sandbox Code Playgroud)

这并没有说明三角形.请注意,在一行的末尾可以有多个反斜杠,但是行拼接并不关心有多少反斜杠,但后续处理可能会.等等.编写单个正则表达式以处理所有这些情况将是非平凡的(但这是不可能的).

Answer 4

zvo*_*ase 6

我不知道您是否熟悉sed基于UNIX(但可用Windows)的文本解析程序,但我在这里找到了一个sed脚本,它将从文件中删除C/C++注释.它很聪明; 例如,如果在字符串声明中找到它,它将忽略'//'和'/*'.在Python中,可以使用以下代码使用它:

import subprocess
from cStringIO import StringIO

input = StringIO(source_code) # source_code is a string with the source code.
output = StringIO()

process = subprocess.Popen(['sed', '/path/to/remccoms3.sed'],
    input=input, output=output)
return_code = process.wait()

stripped_code = output.getvalue()

Run Code Online (Sandbox Code Playgroud)

在这个程序中,source_code是包含C/C++源代码的变量,并最终stripped_code将保留C/C++代码并删除注释.当然,如果你有磁盘上的文件,你可以让input和output变量是指向这些文件的文件句柄(input在读模式下,output在写模式下).remccoms3.sed是来自上述链接的文件,它应保存在磁盘上的可读位置.sed也可以在Windows上使用,并且默认安装在大多数GNU/Linux发行版和Mac OS X上.

这可能比纯Python解决方案更好; 无需重新发明轮子.

不要使用Sed为Python脚本引入额外的脚本和工具依赖项.选择Sed或Python,而不是两者. (25认同)
打开另一个进程是不好的。它既昂贵又危险。我建议坚持使用纯 python。 (2认同)
它不是蟒蛇。是贝壳啊如果在窗户上怎么办？ (2认同)

Answer 5

Men*_*ngh 6

这篇文章提供了一个编码出来的版本,改进了Markus Jarderot的代码,由atikat在对Markus Jarderot发表的评论中描述.(感谢两者提供原始代码,这为我节省了很多工作.)

更全面地描述改进:改进使线路编号完好无损.(这是通过在替换C/C++注释的字符串中保持换行符完整来完成的.)

当您想要向用户生成包含行号的错误消息(例如解析错误)时(即对原始文本有效的行号),此版本的C/C++注释删除功能是合适的.

import re

def removeCCppComment( text ) :

    def blotOutNonNewlines( strIn ) :  # Return a string containing only the newline chars contained in strIn
        return "" + ("\n" * strIn.count('\n'))

    def replacer( match ) :
        s = match.group(0)
        if s.startswith('/'):  # Matched string is //...EOL or /*...*/  ==> Blot out all non-newline chars
            return blotOutNonNewlines(s)
        else:                  # Matched string is '...' or "..."  ==> Keep unchanged
            return s

    pattern = re.compile(
        r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
        re.DOTALL | re.MULTILINE
    )

    return re.sub(pattern, replacer, text)

Run Code Online (Sandbox Code Playgroud)

归档时间：	17 年，3 月前
查看次数：	33671 次
最近记录：	6 年，7 月前