Ger*_*oge 2 linux bash awk sed
文件1:
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468
chr1 14969 15038 NR_024540_1_r_WASH7P_69
chr1 15795 15947 NR_024540_2_r_WASH7P_152
chr1 16606 16765 NR_024540_3_r_WASH7P_15
chr1 16857 17055 NR_024540_4_r_WASH7P_198
Run Code Online (Sandbox Code Playgroud)
和file2:
NR_024540 11
Run Code Online (Sandbox Code Playgroud)
我需要找到比赛file2中file1并打印全file1 + second column of file2
所以ouptut是:
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
Run Code Online (Sandbox Code Playgroud)
我的解决方案在bash中非常缓慢:
#!/bin/bash
while read line; do
c=$(echo $line | awk '{print $1}')
d=$(echo $line | awk '{print $2}')
grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}' >> output
done < file2
Run Code Online (Sandbox Code Playgroud)
我更喜欢更快的任何bash或awk解决方案.输出可以修改,但需要保留所有信息(列的顺序可以不同).
编辑:
现在它看起来像@chepner最快的解决方案:
#!/bin/bash
while read -r c d; do
grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}'
done < file2 > output
Run Code Online (Sandbox Code Playgroud)
在一个Awk命令中,
awk 'FNR==NR{map[$1]=$2; next}{ for (i in map) if($0 ~ i){$(NF+1)=map[i]; print; next}}' file2 file1
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
Run Code Online (Sandbox Code Playgroud)
多线程中更易读的版本
FNR==NR {
# map the values from 'file2' into the hash-map 'map'
map[$1]=$2
next
}
# On 'file1' do
{
# Iterate through the array map
for (i in map){
# If there is a direct regex match on the line with the
# element from the hash-map, print it and append the
# hash-mapped value at last
if($0 ~ i){
$(NF+1)=map[i]
print
next
}
}
}
Run Code Online (Sandbox Code Playgroud)