在bash中使用20g文件

Question

在bash中使用20g文件

Zac*_*ack 3 regex unix bash performance grep

关于代码性能的问题:我正在尝试针对~20g文本文件运行~25个正则表达式规则.脚本应该输出匹配到文本文件; 每个正则表达式规则生成自己的文件.请参阅下面的伪代码:

regex_rules=~/Documents/rulesfiles/regexrulefile.txt
for tmp in *.unique20gbfile.suffix; do
    while read line
    # Each $line in the looped-through file contains a regex rule, e.g.,
    # egrep -i '(^| )justin ?bieber|(^| )selena ?gomez'
    # $rname is a unique rule name generated by a separate bash function
    # exported to the current shell.
        do
        cmd="$line $tmp > ~/outputdir/$tmp.$rname.filter.piped &"
        eval $cmd
    done < $regex_rules
done

Run Code Online (Sandbox Code Playgroud)

几个想法:

有没有办法循环文本文件一次,评估所有规则并一次性拆分到单个文件？这会更快吗？
我应该使用不同的工具来完成这项工作吗？

谢谢.

Answer 1

Ala*_*rry 5

这是grep有-f选择权的原因.减少你regexrulefile.txt的正则表达式,每行一个,然后运行

egrep -f regexrulefile.txt the_big_file

Run Code Online (Sandbox Code Playgroud)

这会在单个输出流中生成所有匹配项,但您可以在其后执行循环操作以将它们分开.假设组合的比赛列表不是很大,这将是一场表现胜利.

归档时间：	13 年，3 月前
查看次数：	388 次
最近记录：	13 年，3 月前