如何计算文件中文本的出现次数?

j0h*_*j0h 19 command-line bash sort uniq

我有一个按 IP 地址排序的日志文件,我想找出每个唯一 IP 地址出现的次数。我怎样才能用 bash 做到这一点?可能会列出 ip 旁边出现的次数,例如:

5.135.134.16 count: 5
13.57.220.172: count 30
18.206.226 count:2
Run Code Online (Sandbox Code Playgroud)

等等。

这是日志的示例:

5.135.134.16 - - [23/Mar/2019:08:42:54 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
5.135.134.16 - - [23/Mar/2019:08:42:55 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
5.135.134.16 - - [23/Mar/2019:08:42:55 -0400] "POST /wp-login.php HTTP/1.1" 200 3836 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
5.135.134.16 - - [23/Mar/2019:08:42:55 -0400] "POST /wp-login.php HTTP/1.1" 200 3988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
5.135.134.16 - - [23/Mar/2019:08:42:56 -0400] "POST /xmlrpc.php HTTP/1.1" 200 413 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:05 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:06 -0400] "POST /wp-login.php HTTP/1.1" 200 3985 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:07 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:08 -0400] "POST /wp-login.php HTTP/1.1" 200 3833 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:09 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:11 -0400] "POST /wp-login.php HTTP/1.1" 200 3836 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:12 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:15 -0400] "POST /wp-login.php HTTP/1.1" 200 3837 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:17 -0400] "POST /xmlrpc.php HTTP/1.1" 200 413 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.233.99 - - [23/Mar/2019:04:17:45 -0400] "GET / HTTP/1.1" 200 25160 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
18.206.226.75 - - [23/Mar/2019:21:58:07 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "https://www.google.com/url?3a622303df89920683e4421b2cf28977" "Mozilla/5.0 (Windows NT 6.2; rv:33.0) Gecko/20100101 Firefox/33.0"
18.206.226.75 - - [23/Mar/2019:21:58:07 -0400] "POST /wp-login.php HTTP/1.1" 200 3988 "https://www.google.com/url?3a622303df89920683e4421b2cf28977" "Mozilla/5.0 (Windows NT 6.2; rv:33.0) Gecko/20100101 Firefox/33.0"
18.213.10.181 - - [23/Mar/2019:14:45:42 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
18.213.10.181 - - [23/Mar/2019:14:45:42 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
18.213.10.181 - - [23/Mar/2019:14:45:42 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
Run Code Online (Sandbox Code Playgroud)

Mik*_*ora 39

您可以使用cutuniq工具:

cut -d ' ' -f1 test.txt  | uniq -c
      5 5.135.134.16
      9 13.57.220.172
      1 13.57.233.99
      2 18.206.226.75
      3 18.213.10.181
Run Code Online (Sandbox Code Playgroud)

解释 :

  • cut -d ' ' -f1 :提取第一个字段(IP地址)
  • uniq -c : 报告重复行并显示出现次数

  • 可以使用 `sed`,例如 `sed -E 's/ *(\S*) *(\S*)/\2 count: \1/'` 来获得与 OP 完全相同的输出。 (6认同)
  • 这应该是公认的答案,因为甜点需要重复读取文件,所以速度要慢得多。而且你可以很容易地使用`sort file | cut ....` 以防您不确定文件是否已排序。 (2认同)

ste*_*ver 14

如果您不特别需要给定的输出格式,那么我会推荐已经发布的cut+uniq型答案

如果您确实需要给定的输出格式,那么在 awk 中的单程方式是

awk '{c[$1]++} END{for(i in c) print i, "count: " c[i]}' log
Run Code Online (Sandbox Code Playgroud)

当输入已经排序时,这有点不理想,因为它不必要地将所有 IP 存储到内存中 - 在预排序的情况下(更直接等价于uniq -c)的更好但更复杂的方法是:

awk '
  NR==1 {last=$1} 
  $1 != last {print last, "count: " c[last]; last = $1} 
  {c[$1]++} 
  END {print last, "count: " c[last]}
'
Run Code Online (Sandbox Code Playgroud)

前任。

$ awk 'NR==1 {last=$1} $1 != last {print last, "count: " c[last]; last = $1} {c[$1]++} END{print last, "count: " c[last]}' log
5.135.134.16 count: 5
13.57.220.172 count: 9
13.57.233.99 count: 1
18.206.226.75 count: 2
18.213.10.181 count: 3
Run Code Online (Sandbox Code Playgroud)


des*_*ert 13

您可以使用grepuniq作为地址列表,循环遍历它们并grep再次进行计数:

for i in $(<log grep -o '^[^ ]*' | uniq); do
  printf '%s count %d\n' "$i" $(<log grep -c "$i")
done
Run Code Online (Sandbox Code Playgroud)

grep -o '^[^ ]*'输出从开头 ( ^) 到每行的第一个空格的每个字符,uniq删除重复的行,从而为您留下 IP 地址列表。由于命令替换,for循环在这个列表上循环打印当前处理的 IP,然后是“count”和计数。后者由 计算grep -c,它计算至少有一个匹配项的行数。

示例运行

$ for i in $(<log grep -o '^[^ ]*'|uniq);do printf '%s count %d\n' "$i" $(<log grep -c "$i");done
5.135.134.16 count 5
13.57.220.172 count 9
13.57.233.99 count 1
18.206.226.75 count 2
18.213.10.181 count 3
Run Code Online (Sandbox Code Playgroud)

  • 此解决方案重复遍历输入文件,每个 IP 地址一次,如果文件很大,这将非常慢。使用 `uniq -c` 或 `awk` 的其他解决方案只需要读取一次文件, (13认同)
  • 我不会称之为过早优化,因为更有效的解决方案也更简单,但每个人都有自己的。 (3认同)

pa4*_*080 8

这是一种可能的解决方案:

IN_FILE="file.log"
for IP in $(awk '{print $1}' "$IN_FILE" | sort -u)
do
    echo -en "${IP}\tcount: "
    grep -c "$IP" "$IN_FILE"
done
Run Code Online (Sandbox Code Playgroud)
  • 替换file.log为实际文件名。
  • 命令替换表达式$(awk '{print $1}' "$IN_FILE" | sort -u)将提供第一列的唯一值列表。
  • 然后grep -c将计算文件中的每个值。

$ IN_FILE="file.log"; for IP in $(awk '{print $1}' "$IN_FILE" | sort -u); do echo -en "${IP}\tcount: "; grep -c "$IP" "$IN_FILE"; done
13.57.220.172   count: 9
13.57.233.99    count: 1
18.206.226.75   count: 2
18.213.10.181   count: 3
5.135.134.16    count: 5
Run Code Online (Sandbox Code Playgroud)


ter*_*don 5

一些 Perl:

$ perl -lae '$k{$F[0]}++; }{ print "$_ count: $k{$_}" for keys(%k)' log 
13.57.233.99 count: 1
18.206.226.75 count: 2
13.57.220.172 count: 9
5.135.134.16 count: 5
18.213.10.181 count: 3
Run Code Online (Sandbox Code Playgroud)

这与Steeldriver 的 awk 方法的想法相同,但在 Perl 中。的-a原因Perl来每个输入行自动分割成阵列@F,其第一个元素(IP)是$F[0]。因此,$k{$F[0]}++将创建 hash %k,其键是 IP,其值是每个 IP 被看到的次数。的}{是“做休息在最后,加工后输入”时髦perlspeak。因此,最后,脚本将遍历散列的键并打印当前键 ( $_) 及其值 ( $k{$_})。

而且,只是为了让人们不认为 perl 强迫您编写看起来像神秘涂鸦的脚本,这是同一件事,但形式不那么简洁:

perl -e '
  while (my $line=<STDIN>){
    @fields = split(/ /, $line);
    $ip = $fields[0];
    $counts{$ip}++;
  }
  foreach $ip (keys(%counts)){
    print "$ip count: $counts{$ip}\n"
  }' < log
Run Code Online (Sandbox Code Playgroud)