为什么 egrep [wW][oO][rR][dD] 比 grep -i word 快？

Question

为什么 egrep [wW][oO][rR][dD] 比 grep -i word 快？

我使用grep -i得更频繁，我发现它比egrep等效的慢，我匹配每个字母的大写或小写：

$ time grep -iq "thats" testfile

real    0m0.041s
user    0m0.038s
sys     0m0.003s
$ time egrep -q "[tT][hH][aA][tT][sS]" testfile

real    0m0.010s
user    0m0.003s
sys     0m0.006s

Run Code Online (Sandbox Code Playgroud)

是否grep -i做额外的测试，egrep不？

Answer 1

Gil*_*il' 70

grep -i 'a'相当于grep '[Aa]'在仅 ASCII 语言环境中。在 Unicode 语言环境中，字符等价和转换可能很复杂，因此grep可能需要做额外的工作来确定哪些字符是等价的。相关的语言环境设置是LC_CTYPE，它决定了字节如何解释为字符。

根据我的经验，grep在 UTF-8 语言环境中调用GNU 时可能会很慢。如果您知道您只搜索 ASCII 字符，则在仅 ASCII 语言环境中调用它可能会更快。我希望

time LC_ALL=C grep -iq "thats" testfile
time LC_ALL=C egrep -q "[tT][hH][aA][tT][sS]" testfile

Run Code Online (Sandbox Code Playgroud)

会产生无法区分的时间。

话虽如此，我无法grep在 Debian jessie 上使用 GNU 重现您的发现（但您没有指定您的测试文件）。如果我设置 ASCII 语言环境 ( LC_ALL=C)，grep -i速度会更快。效果取决于字符串的确切性质，例如，具有重复字符的字符串会降低性能（这是意料之中的）。

Answer 2

mur*_*uru 15

出于好奇，我在 Arch Linux 系统上进行了测试：

$ uname -r
4.4.5-1-ARCH
$ df -h .
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           3.9G  720K  3.9G   1% /tmp
$ dd if=/dev/urandom bs=1M count=1K | base64 > foo
$ df -h .                                         
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           3.9G  1.4G  2.6G  35% /tmp
$ for i in {1..100}; do /usr/bin/time -f '%e' -ao grep.log grep -iq foobar foo; done
$ for i in {1..100}; do /usr/bin/time -f '%e' -ao egrep.log egrep -q '[fF][oO][oO][bB][aA][rR]' foo; done

$ grep --version
grep (GNU grep) 2.23
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.

Run Code Online (Sandbox Code Playgroud)

然后一些统计数据礼貌有没有办法在单个命令中获得一系列数字的最小值、最大值、中值和平均值？：

$ R -q -e "x <- read.csv('grep.log', header = F); summary(x); sd(x[ , 1])"
> x <- read.csv('grep.log', header = F); summary(x); sd(x[ , 1])
       V1       
 Min.   :1.330  
 1st Qu.:1.347  
 Median :1.360  
 Mean   :1.362  
 3rd Qu.:1.370  
 Max.   :1.440  
[1] 0.02322725
> 
> 
$ R -q -e "x <- read.csv('egrep.log', header = F); summary(x); sd(x[ , 1])"
> x <- read.csv('egrep.log', header = F); summary(x); sd(x[ , 1])
       V1       
 Min.   :1.330  
 1st Qu.:1.340  
 Median :1.360  
 Mean   :1.365  
 3rd Qu.:1.380  
 Max.   :1.430  
[1] 0.02320288
> 
>

Run Code Online (Sandbox Code Playgroud)

我在en_GB.utf8当地，但时间几乎无法区分。

归档时间：	10 年，1 月前
查看次数：	4916 次
最近记录：	10 年，1 月前