我有一个类似于下面示例的文件。第一列是 SNP id。
head data
2L:647803 1 2 44.31655 -12.2373
2L:647803 1 2 43.63717 -12.302
2L:647803 1 2 43.80007 -12.3451
2L:2602906 1 2 43.39748 -11.4894
2L:2602906 1 2 44.43951 -12.3093
2L:2602906 1 2 43.80007 -12.3451
2L:3146785 1 2 44.31655 -12.2373
2L:3146785 1 2 44.43951 -12.3093
2L:3146785 1 2 43.80007 -12.3451
2L:3771395 1 2 43.39748 -11.4894
2L:3771395 1 2 43.2661 -11.6803
2L:3945568 1 2 43.63717 -12.302
2L:3945568 1 2 43.39032 -11.6099
Run Code Online (Sandbox Code Playgroud)
对于每个 SNP ( 2L:647803
, 2L:2602906
, 2L:3146785
, ...),我想要 3 行。如果每个 SNP 没有 3 行,我想删除该 SNP。这是我想要的输出:(2L:3771395
并且 2L:3945568
被删除,因为每个输出只有两个实例)。
head desired
2L:647803 1 2 44.31655 -12.2373
2L:647803 1 2 43.63717 -12.302
2L:647803 1 2 43.80007 -12.3451
2L:2602906 1 2 43.39748 -11.4894
2L:2602906 1 2 44.43951 -12.3093
2L:2602906 1 2 43.80007 -12.3451
2L:3146785 1 2 44.31655 -12.2373
2L:3146785 1 2 44.43951 -12.3093
2L:3146785 1 2 43.80007 -12.3451
Run Code Online (Sandbox Code Playgroud)
不优雅但实用:
$ awk 'NR==FNR {a[$1]++; next} a[$1]==3' data data
2L:647803 1 2 44.31655 -12.2373
2L:647803 1 2 43.63717 -12.302
2L:647803 1 2 43.80007 -12.3451
2L:2602906 1 2 43.39748 -11.4894
2L:2602906 1 2 44.43951 -12.3093
2L:2602906 1 2 43.80007 -12.3451
2L:3146785 1 2 44.31655 -12.2373
2L:3146785 1 2 44.43951 -12.3093
2L:3146785 1 2 43.80007 -12.3451
Run Code Online (Sandbox Code Playgroud)