我如何并行grep

Lai*_*han 5 linux grep

我经常用grep -rIn pattern_str big_source_code_dir来找东西.但是grep不平行,我该如何让它平行?我的系统有4个核心,如果grep可以使用所有核心,它会更快.

Ily*_*lya 11

如果您使用HDD来存储您正在搜索的目录,则不会提高速度.硬盘驱动器几乎是单线程访问单元.

但是,如果你真的想这样做平行的grep,那么这个网站提供了如何与做两份提示findxargs.例如

find . -type f -print0 | xargs -0 -P 4 -n 40 grep -i foobar
Run Code Online (Sandbox Code Playgroud)

  • 请注意,使用“xargs”您可能会得到混合输出。要查看此操作,请参阅:http://www.gnu.org/software/parallel/man.html#differences_ Between_xargs_and_gnu_parallel (2认同)

Mor*_*tus 5

GNUparallel命令对此非常有用。

sudo apt-get install parallel # if not available on debian based systems
Run Code Online (Sandbox Code Playgroud)

然后,paralell手册页提供了一个示例:

EXAMPLE: Parallel grep
       grep -r greps recursively through directories. 
       On multicore CPUs GNU parallel can often speed this up.

       find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}

       This will run 1.5 job per core, and give 1000 arguments to grep.
Run Code Online (Sandbox Code Playgroud)

在你的情况下,它可能是:

find big_source_code_dir -type f | parallel -k -j150% -n 1000 -m grep -H -n pattern_str {}
Run Code Online (Sandbox Code Playgroud)

最后,GNU 并行手册页还提供了一个部分,描述了xargsparallel命令之间的差异,这应该有助于理解为什么在您的情况下并行看起来更好

DIFFERENCES BETWEEN xargs AND GNU Parallel
       xargs offer some of the same possibilities as GNU parallel.

       xargs deals badly with special characters (such as space, ' and "). To see the problem try this:

         touch important_file
         touch 'not important_file'
         ls not* | xargs rm
         mkdir -p "My brother's 12\" records"
         ls | xargs rmdir

       You can specify -0 or -d "\n", but many input generators are not optimized for using NUL as separator but are optimized for newline as separator. E.g head, tail, awk, ls, echo, sed, tar -v, perl (-0 and \0 instead of \n),
       locate (requires using -0), find (requires using -print0), grep (requires user to use -z or -Z), sort (requires using -z).

       So GNU parallel's newline separation can be emulated with:

       cat | xargs -d "\n" -n1 command

       xargs can run a given number of jobs in parallel, but has no support for running number-of-cpu-cores jobs in parallel.

       xargs has no support for grouping the output, therefore output may run together, e.g. the first half of a line is from one process and the last half of the line is from another process. The example Parallel grep cannot be
       done reliably with xargs because of this.
       ...
Run Code Online (Sandbox Code Playgroud)

  • 我不同意你的不同意:# time grep -E 'invalid user (\S+) from ([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+) ) port ([0-9]+)' /var/log/auth.log 在我的 i7 上显示 10 秒然后测试驱动器的速度:# dd if=/var/log/auth.log of=/dev/null bs =1M 以 130MB/s 的速度为 600MB 提供 4 秒但是上面的 grep 需要 3 多的时间,接近 40MB/秒来读取数据。所以,这里正则表达式的处理时间是最广泛的并行运行:parallel --pipe --block 16M grep -E 'invalid user (\S+) from ([0-9]+\.[0-9]+ \.[0-9]+\.[0-9]+) 端口 ([0-9]+)' </var/log/auth.log 用 3 秒代替 10... (4认同)
  • 并行 grep 对于高延迟的网络安装非常方便。 (2认同)