我正在寻找如何使用Python来摆脱文本中不必要的换行,就像你从Project Gutenberg那里得到的那样,他们的普通文本文件用每70个字符左右的换行符进行格式化.在Tcl中,我可以做一个简单的string map,像这样:
set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]
Run Code Online (Sandbox Code Playgroud)
这将使段落由两个换行符(或换行符和制表符)分开,但将以一个换行符替换的行(替换为空格)一起运行,并删除多余的CR.由于Python没有string map,我还没有能够找出最有效的方式来转储所有不必要的换行,但我敢肯定它是不是只是为了搜索每个换行,以便和用空格代替它.我可以在Python中评估Tcl表达式,如果所有其他方法都失败了,但我想找出最好的Pythonic方法来做同样的事情.一些Python鉴赏家可以帮助我吗?
与 tcl 最接近的等效项string map是str.translate,但不幸的是它只能映射单个字符。因此,有必要使用正则表达式来获得类似的紧凑示例。这可以通过look-behind/look-ahead断言来完成,但是\r必须首先替换:
import re
oldtext = """\
This would keep paragraphs separated.
This would keep paragraphs separated.
This would keep paragraphs separated.
\tThis would keep paragraphs separated.
\rWhen, in the course
of human events,
it becomes necessary
\rfor one people
"""
newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))
Run Code Online (Sandbox Code Playgroud)
输出:
This would keep paragraphs separated. This would keep paragraphs separated.
This would keep paragraphs separated.
This would keep paragraphs separated.
When, in the course of human events, it becomes necessary for one people
Run Code Online (Sandbox Code Playgroud)
不过,我怀疑这是否与 tcl 代码一样有效。
更新:
我使用古腾堡项目电子书《战争与和平》(纯文本 UTF-8,3.1 MB)做了一些测试。这是我的 tcl 脚本:
set fp [open "gutenberg.txt" r]
set oldtext [read $fp]
close $fp
set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]
puts $newtext
Run Code Online (Sandbox Code Playgroud)
和我的Python等效项:
import re
with open('gutenberg.txt') as stream:
oldtext = stream.read()
newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))
print(newtext)
Run Code Online (Sandbox Code Playgroud)
粗性能测试:
$ /usr/bin/time -f '%E' tclsh gutenberg.tcl > output1.txt
0:00.18
$ /usr/bin/time -f '%E' python gutenberg.py > output2.txt
0:00.30
Run Code Online (Sandbox Code Playgroud)
因此,正如预期的那样,tcl 版本的效率更高。然而,python 版本的输出似乎更干净(在行首没有插入额外的空格)。
| 归档时间: |
|
| 查看次数: |
1059 次 |
| 最近记录: |