我遇到grep和awk的问题.我认为这是因为我的输入文件包含看起来像代码的文本.
输入文件包含ID名称,如下所示:
SNORD115-40
MIR432
RNU6-2
Run Code Online (Sandbox Code Playgroud)
参考文件如下所示:
Ensembl Gene ID HGNC symbol
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000266661
ENSG00000243133
ENSG00000207447 RNU6-2
Run Code Online (Sandbox Code Playgroud)
我想将源文件中的ID名称与我的参考文件相匹配,并打印出相应的ensg ID号,以便输出文件如下所示:
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2
Run Code Online (Sandbox Code Playgroud)
我试过这个循环:
exec < source.file
while read line
do
grep -w $line reference.file > outputfile
done
Run Code Online (Sandbox Code Playgroud)
我也试过用awk来玩这个参考文件
awk 'NF == 2 {print $0}' reference file
awk 'NF >2 {print $0}' reference file
Run Code Online (Sandbox Code Playgroud)
但我只得到一个grep'd ID.
任何建议或更简单的方法都会很棒.
$ fgrep -f source.file reference.file
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2
Run Code Online (Sandbox Code Playgroud)
fgrep相当于grep -F:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
Run Code Online (Sandbox Code Playgroud)
该-f选项用于PATTERN从文件中获取:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
Run Code Online (Sandbox Code Playgroud)
如注释中所述,如果ID in reference.file包含source.file作为子字符串的ID,则会产生误报.您可以构建一个更明确的模式来grep上飞sed:
grep -f <( sed 's/.*/ &$/' input.file) reference.file
Run Code Online (Sandbox Code Playgroud)
但是这样,模式被解释为正则表达式而不是固定字符串,这可能是易受攻击的(尽管如果ID只包含字母数字字符,则可能没问题).但更好的方法是(感谢@sidharthcnadhan),使用-w选项:
-w, --word-regexp
Select only those lines containing matches that form whole
words. The test is that the matching substring must either be
at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end
of the line or followed by a non-word constituent character.
Word-constituent characters are letters, digits, and the
underscore.
Run Code Online (Sandbox Code Playgroud)
所以你问题的最终答案是:
grep -Fwf source.file reference.file
Run Code Online (Sandbox Code Playgroud)