如何计算重复的最后一列而不删除它们?

Lin*_*Lin 5 command-line shell text-processing

我有一个包含 4 列的文件。我想比较最后三列并计算它们出现的次数而不删除任何行。我只希望计数出现在每一行的前面。

我的文件看起来像这样:

ID-jacob  4.0  6.0  42.0  
ID-elsa   5.0  8.0  45.0  
ID-fred   4.0  6.0  42.0  
ID-gerard 6.0  8.0  20.0  
ID-trudy  5.0  8.0  45.0  
ID-tessa  4.0  6.0  42.0
Run Code Online (Sandbox Code Playgroud)

我想要的结果是:

3 ID-jacob  4.0  6.0  42.0  
2 ID-elsa   5.0  8.0  45.0  
3 ID-fred   4.0  6.0  42.0  
1 ID-gerard 6.0  8.0  20.0  
2 ID-trudy  5.0  8.0  45.0  
3 ID-tessa  4.0  6.0  42.0
Run Code Online (Sandbox Code Playgroud)

我尝试使用 sort 和 uniq,但这只会给我每个重复行的第一行:

cat file | sort -k2,4 | uniq -c -f1 > outputfile
Run Code Online (Sandbox Code Playgroud)

bsd*_*bsd 3

您可能会在内存中存储大文件时遇到麻烦,这稍微好一点,因为它只存储匹配的行,在排序完成了将行按顺序排列的繁重工作之后。

# Input must be sorted first, then we only need to keep matching lines in memory
# Once we reach a non-matching line we print the lines in memory, prefixed by count
# with awk, variables are unset to begin with so, we can get away without explicitly initializing
{ # S2, S3, S4 are saved field values
  if($2 == S2 && $3 == S3 && $4 == S4) {
    # if fields 2,3,4 are same as last, save line in array, increment count
    line[count++] = $0;
  } else {
    # new line with fields 2, 3, 4 different
    # print stored lines, prefixed by the count
    for(i in line) {
      print count, line[i];
    }
    # reset counter and array
    count=0;
    delete line;
    # save this line in array, increment count
    line[count++] = $0;
  }

  # store field values to compare with next line read
  S2 = $2; S3 = $3; S4 = $4;
}
END{ # on EOF we still have saved lines in array, print last lines
    for(i in line) {
      print count, line[i];
    }
}  
Run Code Online (Sandbox Code Playgroud)

通常将awk脚本保存在文件中。
您可以按照以下方式使用它
sort -k2,4 file | awk -f script

3 ID-fred   4.0  6.0  42.0  
3 ID-jacob  4.0  6.0  42.0  
3 ID-tessa  4.0  6.0  42.0
2 ID-elsa   5.0  8.0  45.0  
2 ID-trudy  5.0  8.0  45.0  
1 ID-gerard 6.0  8.0  20.0  
Run Code Online (Sandbox Code Playgroud)