更快的解决方案来比较bash中的文件

Question

更快的解决方案来比较bash中的文件

文件1:

chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468
chr1    14969   15038   NR_024540_1_r_WASH7P_69
chr1    15795   15947   NR_024540_2_r_WASH7P_152
chr1    16606   16765   NR_024540_3_r_WASH7P_15
chr1    16857   17055   NR_024540_4_r_WASH7P_198

Run Code Online (Sandbox Code Playgroud)

和file2:

NR_024540 11

Run Code Online (Sandbox Code Playgroud)

我需要找到比赛file2中file1并打印全file1 + second column of file2

所以ouptut是:

  chr1  14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1    14969   15038   NR_024540_1_r_WASH7P_69 11
chr1    15795   15947   NR_024540_2_r_WASH7P_152 11
chr1    16606   16765   NR_024540_3_r_WASH7P_15 11
chr1    16857   17055   NR_024540_4_r_WASH7P_198 11

Run Code Online (Sandbox Code Playgroud)

我的解决方案在bash中非常缓慢:

#!/bin/bash

while read line; do

c=$(echo $line | awk '{print $1}')
d=$(echo $line | awk '{print $2}')

grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}' >> output


 done < file2

Run Code Online (Sandbox Code Playgroud)

我更喜欢更快的任何bash或awk解决方案.输出可以修改,但需要保留所有信息(列的顺序可以不同).

编辑:

现在它看起来像@chepner最快的解决方案:

#!/bin/bash

while read -r c d; do

grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}' 

done < file2 > output

Run Code Online (Sandbox Code Playgroud)

Answer 1

Ini*_*ian 5

在一个Awk命令中,

awk 'FNR==NR{map[$1]=$2; next}{ for (i in map) if($0 ~ i){$(NF+1)=map[i]; print; next}}' file2 file1

chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11

Run Code Online (Sandbox Code Playgroud)

多线程中更易读的版本

FNR==NR {
    # map the values from 'file2' into the hash-map 'map'
    map[$1]=$2
    next
}
# On 'file1' do
{
    # Iterate through the array map
    for (i in map){
        # If there is a direct regex match on the line with the 
        # element from the hash-map, print it and append the 
        # hash-mapped value at last
        if($0 ~ i){
            $(NF+1)=map[i]
            print
            next
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，9 月前
查看次数：	138 次
最近记录：	8 年，9 月前