是否有可能使这个shell脚本更快？

Question

是否有可能使这个shell脚本更快？

我有一个创建脚本的任务,该脚本将一个巨大的文本文件作为输入.然后,它需要查找所有单词和出现次数,并创建一个新文件,每行显示一个唯一的单词及其出现位置.

举个例子,拿一个包含这个内容的文件:

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor 
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud 
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure
dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.   
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt 
mollit anim id est laborum.

Run Code Online (Sandbox Code Playgroud)

我需要创建一个如下所示的文件:

1 AD
1 ADIPISICING
1 ALIQUA
...
1 ALIQUIP
1 DO
2 DOLOR
2 DOLORE
...

Run Code Online (Sandbox Code Playgroud)

为此我用脚本编写了一个脚本tr,sort并且uniq:

#!/bin/sh
INPUT=$1
OUTPUT=$2
if [ -a $INPUT ]
then
    tr '[:space:][\-_?!.;\:]' '\n' < $INPUT | 
        tr -d '[:punct:][:special:][:digit:]' |
        tr '[:lower:]' '[:upper:]' |
        sort |
        uniq -c > $OUTPUT
fi

Run Code Online (Sandbox Code Playgroud)

这样做是将空格分隔为分隔符.如果这个词包含-_?!.;:我再次将它们分解为单词.我删除标点符号,特殊字符和数字,并将整个字符串转换为大写.完成后,我对其进行排序并将其传递uniq给我想要的格式.

现在我以txt格式下载了圣经并将其用作输入.我得到的时机:

scripts|$ time ./text-to-word.sh text.txt b     
./text-to-word.sh text.txt b  16.17s user 0.09s system 102% cpu 15.934 total

Run Code Online (Sandbox Code Playgroud)

我用Python脚本做了同样的事情:

import re
from collections import Counter
from itertools import chain
import sys

file = open(sys.argv[1])

c = Counter()

for line in file.readlines():
    c.update([re.sub('[^a-zA-Z]', '', l).upper()
            for l in chain(*[re.split('[-_?!.;:]', word)
                    for word in line.split()])])

file2 = open('output.txt', 'w')
for key in sorted(c):
    file2.write(key + ' ' + str(c[key]) + '\n')

Run Code Online (Sandbox Code Playgroud)

当我执行脚本时,我得到了:

scripts|$ time python text-to-word.py text.txt
python text-to-word.py text.txt  7.23s user 0.04s system 97% cpu 7.456 total

Run Code Online (Sandbox Code Playgroud)

正如你可以看到它在跑7.23s相比,跑在shell脚本16.17s.我尝试过更大的文件,但Python似乎总是胜利.我对上面的Senario有几个问题:

鉴于shell命令是用C语言编写的,为什么Python脚本会更快？我确实认为shell脚本可能不是最佳的.
如何改进shell脚本？
我可以改进Python脚本吗？

要明确我不是将Python与shell脚本进行比较.我并不是想要开始一场火焰战争,或者不需要用任何其他语言来回答自己的速度更快.使用管理小命令来执行任务的UNIX理念,如何使shell脚本更快？

Answer 1

Aar*_*lla 7

这里的一个重点可能是进程间I/O. Python脚本在内存中包含所有数据,因此在处理数据时不会发生I/O.

另请注意,Python并不是那么慢.Python中的大多数功能都是用C实现的.

shell脚本必须启动5个进程,每个进程必须从中读取整个文本stdin并将整个文本写入stdout四次.

可能有一种方法可以使Python脚本更快一些:您可以将整个文本读取为单个字符串,然后删除所有标点符号,拆分单词然后计算它们:

text = file.read()
text = re.sub(r'[.,:;-_]', '', text)
text = text.upper()
words = re.split(r'\\s+', text)
c = Counter()
c.update(words)

Run Code Online (Sandbox Code Playgroud)

这样可以避免几个嵌套循环的开销.

至于shell脚本:您应该尝试减少进程数.这三个tr过程可能会被一次调用取代sed.

归档时间：	13 年，6 月前
查看次数：	601 次
最近记录：	13 年，6 月前