仅根据分号分隔文件中的一列列出重复行？

Question

仅根据分号分隔文件中的一列列出重复行？

我有一堆文件.这些行中的每一行都有8个分号分隔的列.

我怎样(在Linux中)返回重复行但仅基于第2列？我应该使用grep还是其他什么？

Answer 1

在awk脚本中查看我的评论

$ cat data.txt 
John Thomas;jd;301
Julie Andrews;jand;109
Alex Tremble;atrem;415
John Tomas;jd;302
Alex Trebe;atrem;416

$ cat dup.awk 
BEGIN { FS = ";" }

{
    # Keep count of the fields in second column
    count[$2]++;

    # Save the line the first time we encounter a unique field
    if (count[$2] == 1)
        first[$2] = $0;

    # If we encounter the field for the second time, print the
    # previously saved line
    if (count[$2] == 2)
        print first[$2];

    # From the second time onward. always print because the field is
    # duplicated
    if (count[$2] > 1)
        print
}

Run Code Online (Sandbox Code Playgroud)

示例输出:

$ sort -t ';' -k 2 data.txt | awk -f dup.awk

John Thomas;jd;301
John Tomas;jd;302
Alex Tremble;atrem;415
Alex Trebe;atrem;416

Run Code Online (Sandbox Code Playgroud)

这是我的解决方案#2:

awk -F';' '{print $2}' data.txt |sort|uniq -d|grep -F -f - data.txt

Run Code Online (Sandbox Code Playgroud)

这个解决方案的优点是它保留了行顺序,代价是一起使用许多工具(awk,sort,uniq和fgrep).

awk命令打印出第二个字段,然后对其输出进行排序.接下来,uniq -d命令选出重复的字符串.此时,标准输出包含重复的第二个字段的列表,每行一个.然后我们将该列表输入fgrep.' -f - '标志告诉fgrep从标准输入中查找这些字符串.

是的,你可以用命令行全力以赴.我喜欢第二种解决方案,更好地用于锻炼许多工具和更清晰的逻辑(至少对我而言).缺点是工具的数量和可能使用的内存.此外,第二种解决方案是低效的,因为它扫描数据文件两次:第一次使用awk命令,第二次使用fgrep命令.这种考虑仅在输入文件很大时才有意义.

Answer 2

jtb*_*des 7

有一个复杂的awk脚本.

awk 'BEGIN { FS=";" } { c[$2]++; l[$2,c[$2]]=$0 } END { for (i in c) { if (c[i] > 1) for (j = 1; j <= c[i]; j++) print l[i,j] } }' file.txt

Run Code Online (Sandbox Code Playgroud)

它的工作原理是保留第二个字段中每个值的所有出现的计数器,以及具有该值的行,然后打印出计数器大于1的行.

用$2您需要的任何字段编号替换所有实例,file.txt最后用您的文件名替换.

Answer 3

mjv*_*mjv 1

grep 可以做到这一点，但我猜你使用awk（在某些系统上又名 gawk）会更容易。

用于满足您的需求的有效链/脚本取决于一些额外的信息。例如，输入文件是否易于排序、输入有多大（或者更确切地说是巨大还是流）...

假设输入已排序（无论是最初的还是通过排序的管道），awk 脚本看起来像这样：（注意未经测试）

检查 Jonathan Leffler 或 Hai Vu 提供的解决方案，了解无需预排序要求即可实现相同目的的方法。

#!/usr/bin/awk
# *** Simple AWK script to output duplicate lines found in input ***
#    Assume input is sorted on fields

BEGIN {
    FS = ";";   #delimiter
    dupCtr = 0;       # number of duplicate _instances_
    dupLinesCtr = 0;  # total number of duplicate lines

    firstInSeries = 1;   #used to detect if this is first in series

    prevLine = "";
    prevCol2 = "";  # use another string in case empty field is valid
}

{
  if ($2 == prevCol2) {
    if (firstInSeries == 1) {
      firstInSeries = 0;
      dupCtr++;
      dupLinesCtr++;
      print prevLine
    }
    dupLinesCtr++;
    print $0
  }
  else
     firstInSeries = 1
  prevCol2 = $2
  prevLine = $0
}

END { #optional display of counts etc.
  print "*********"
  print "Total duplicate instances = " iHits "   Total lines = " NR;
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	16 年，8 月前
查看次数：	27593 次
最近记录：	8 年，9 月前