匹配第一次出现的两个文件和打印行

Question

匹配第一次出现的两个文件和打印行

ser*_*gio 4 command-line text-processing

我有两个看起来像这样的文件：

文件 1（唯一 ID）：

Run Code Online (Sandbox Code Playgroud)

和文件2：

    1  C95696352 score:  -69.785 nathvy =  38 nconfs =          888
    2  C98230482 score:  -57.431 nathvy =  47 nconfs =          575
    3  C96209347 score:  -57.128 nathvy =  24 nconfs =         1188
    4  C36510773 score:  -56.502 nathvy =  38 nconfs =         7595
    5  C04355288 score:  -56.400 nathvy =  41 nconfs =        50502
    6  C89372772 score:  -55.728 nathvy =  22 nconfs =         3228
    7  C96209347 score:  -54.713 nathvy =  24 nconfs =          162
    8  C96209347 score:  -53.901 nathvy =  24 nconfs =          159
    9  C06169346 score:  -53.438 nathvy =  22 nconfs =          105
   10  C95696352 score:  -52.848 nathvy =  38 nconfs =          878
   11  C98216318 score:  -52.061 nathvy =  52 nconfs =         1092
   12  C04285713 score:  -52.009 nathvy =  38 nconfs =         1355
   13  C96209347 score:  -51.477 nathvy =  24 nconfs =         1375
   14  C98222837 score:  -50.730 nathvy =  34 nconfs =          588
   15  C98216318 score:  -50.694 nathvy =  52 nconfs =         1136
   16  C32832068 score:  -50.546 nathvy =  22 nconfs =          548
   17  C95696352 score:  -50.475 nathvy =  38 nconfs =         3220
   18  C32832068 score:  -50.457 nathvy =  22 nconfs =        16235
   19  C95696352 score:  -50.234 nathvy =  38 nconfs =         3048
   20  C85594749 score:  -49.780 nathvy =  44 nconfs =         4536
   21  C72332782 score:  -49.676 nathvy =  41 nconfs =         3942
   22  C97970648 score:  -49.616 nathvy =  45 nconfs =        17640
   23  C04285713 score:  -49.594 nathvy =  38 nconfs =        14038
   24  C98043133 score:  -49.370 nathvy =  43 nconfs =         1236
   25  C89372772 score:  -49.308 nathvy =  22 nconfs =          471
   26  C97970648 score:  -49.297 nathvy =  45 nconfs =        17850
   27  C85594749 score:  -49.122 nathvy =  44 nconfs =         4158
   28  C70006381 score:  -49.092 nathvy =  24 nconfs =          880

Run Code Online (Sandbox Code Playgroud)

我想将 IDfile1与file2（第二列）中的 ID 以及匹配的 ID匹配以打印它们。此外，在file2某些 ID 中是重复的，例如C96209347（尽管整行不相同）。我想 grep 那些第一次出现的行，而其他人则跳过。所以在这个特定的例子中，应该只打印C96209347第三行 from file2。有人可以帮忙吗？

Answer 1

pLu*_*umo 9

尝试这个，

grep -f file1 file2 | awk '!_[$2]++'

 1  C95696352 score:  -69.785 nathvy =  38 nconfs =          888
 3  C96209347 score:  -57.128 nathvy =  24 nconfs =         1188
 6  C89372772 score:  -55.728 nathvy =  22 nconfs =         3228
20  C85594749 score:  -49.780 nathvy =  44 nconfs =         4536

Run Code Online (Sandbox Code Playgroud)

解释

grep -f file1 file2: 在 file2 中搜索从 file1 获得的模式的匹配项
awk '!_[$2]++'：如果$2之前已经看到过字段，则不要打印任何内容（通过）
- _ 是数组名称（可以是任何东西，例如“seen”）
- _[$2]++将创建一个数组条目，其键是字段的内容$2并添加 1
- 如果_[$2]是没有（!）已设置，打印线。该print命令是 awk 在条件匹配时执行的默认操作。

归档时间：	7 年，2 月前
查看次数：	229 次
最近记录：	7 年，2 月前