使用grep或awk匹配文本

use*_*573 4 awk grep

我遇到grep和awk的问题.我认为这是因为我的输入文件包含看起来像代码的文本.

输入文件包含ID名称,如下所示:

SNORD115-40
MIR432
RNU6-2
Run Code Online (Sandbox Code Playgroud)

参考文件如下所示:

Ensembl Gene ID HGNC symbol
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000266661
ENSG00000243133
ENSG00000207447 RNU6-2
Run Code Online (Sandbox Code Playgroud)

我想将源文件中的ID名称与我的参考文件相匹配,并打印出相应的ensg ID号,以便输出文件如下所示:

ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2
Run Code Online (Sandbox Code Playgroud)

我试过这个循环:

exec < source.file
while read line
do
grep -w $line reference.file > outputfile
done
Run Code Online (Sandbox Code Playgroud)

我也试过用awk来玩这个参考文件

awk 'NF == 2 {print $0}' reference file
awk 'NF >2 {print $0}' reference file
Run Code Online (Sandbox Code Playgroud)

但我只得到一个grep'd ID.

任何建议或更简单的方法都会很棒.

Lev*_*sky 8

$ fgrep -f source.file reference.file 
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2
Run Code Online (Sandbox Code Playgroud)

fgrep相当于grep -F:

   -F, --fixed-strings
          Interpret  PATTERN  as  a  list  of  fixed strings, separated by
          newlines, any of which is to be matched.  (-F  is  specified  by
          POSIX.)
Run Code Online (Sandbox Code Playgroud)

-f选项用于PATTERN从文件中获取:

   -f FILE, --file=FILE
          Obtain  patterns  from  FILE,  one  per  line.   The  empty file
          contains zero patterns, and therefore matches nothing.   (-f  is
          specified by POSIX.)
Run Code Online (Sandbox Code Playgroud)

如注释中所述,如果ID in reference.file包含source.file作为子字符串的ID,则会产生误报.您可以构建一个更明确的模式来grep上飞sed:

grep -f <( sed 's/.*/ &$/' input.file) reference.file
Run Code Online (Sandbox Code Playgroud)

但是这样,模式被解释为正则表达式而不是固定字符串,这可能是易受攻击的(尽管如果ID只包含字母数字字符,则可能没问题).但更好的方法是(感谢@sidharthcnadhan),使用-w选项:

   -w, --word-regexp
          Select  only  those  lines  containing  matches  that form whole
          words.  The test is that the matching substring must  either  be
          at  the  beginning  of  the  line,  or  preceded  by  a non-word
          constituent character.  Similarly, it must be either at the  end
          of  the  line  or  followed by a non-word constituent character.
          Word-constituent  characters  are  letters,  digits,   and   the
          underscore.
Run Code Online (Sandbox Code Playgroud)

所以你问题的最终答案是:

grep -Fwf source.file reference.file
Run Code Online (Sandbox Code Playgroud)

  • 我们可以使用"fgrep -wf source.file reference.file"来避免误报. (2认同)