使用 Bash 从每个基因的 fasta 序列中提取位置 2-7

sia*_*ian 2 awk command-line fasta

我有一个包含基因 ID 子集的文件,以及一个包含所有基因 ID 及其序列的 fasta 文件。对于子集文件中的每个基因,我想从每个 fasta 序列的开头获取位置 2-7。理想情况下,输出文件应该是 'pos 2-7' '\t' 'geneID'。

示例子集:

mmu-let-7g-5p MIMAT0000121  
mmu-let-7i-5p MIMAT0000122 
Run Code Online (Sandbox Code Playgroud)

法斯塔文件:

>mmu-let-7g-5p MIMAT0000121 
UGAGGUAGUAGUUUGUACAGUU
>mmu-let-7i-5p MIMAT0000122 
UGAGGUAGUAGUUUGUGCUGUU
>mmu-let-7f-5p MIMAT0000525 
UGAGGUAGUAGAUUGUAUAGUU
Run Code Online (Sandbox Code Playgroud)

想要的输出:

GAGGUA   mmu-let-7g-5p MIMAT0000121
GAGGUA   mmu-let-7i-5p MIMAT0000122
Run Code Online (Sandbox Code Playgroud)

第一部分(提取基因子集的 fasta 序列)我已经使用grep -w -A 1 -f. 不知道如何获得 pos 2-7 并使输出看起来像现在使用 Bash。

Rav*_*h13 5

您能否尝试以下操作,仅在 GNU 中使用所示示例进行编写和测试awk

awk '
FNR==NR{
  a[$1]=$2
  next
}
/^>/{
  ind=substr($1,2)
}
/^>/ && (ind in a){
  found=1
  val=ind OFS a[ind]
  next
}
found{
  print substr($0,2,6) OFS val
  val=found=""
}
' gene fastafile
Run Code Online (Sandbox Code Playgroud)

说明:对上述内容添加详细说明。

awk '                               ##Starting awk program from here.
FNR==NR{                            ##Checking condition FNR==NR which will be TRUE when gene Input_file is being read.
  a[$1]=$2                          ##Creating array a with index of $1 and value of $2 here.
  next                              ##next will skip all further statements from here.
}
/^>/{                               ##Checking condition if line starts from > then do following.
  ind=substr($1,2)                  ##Creating ind which has substring from 2nd charcters to all values of first field.
}
/^>/ && (ind in a){                 ##Checking if line starts with > and ind is present in array a then do following.
  found=1                           ##Setting found to 1 here.
  val=ind OFS a[ind]                ##Creating val which has ind OFS and value of a with index of ind.
  next                              ##next will skip all further statements from here.
}
found{                              ##Checking condition if found is NOT NULL then do following.
  print substr($0,2,6) OFS val      ##Printing sub string from 2nd to 7th character OFS and val here.
  val=found=""                      ##Nullifying val and found here.
}
' gene fastafile                    ##Mentioning Input_file names here.
Run Code Online (Sandbox Code Playgroud)


Sun*_*eep 5

测试过GNU awk,但我认为它适用于任何awk

$ awk 'NR==FNR{a[$0]; next}
       $1 in a{print substr($2, 2, 6), $1}
      ' gene.txt RS='>' FS='\n' OFS='\t' fasta.txt
GAGGUA  mmu-let-7g-5p MIMAT0000121
GAGGUA  mmu-let-7i-5p MIMAT0000122
Run Code Online (Sandbox Code Playgroud)
  • NR==FNR{a[$0]; next} 以每行内容作为传递给的第一个文件的键构建数组 awk
  • RS='>' FS='\n' OFS='\t'这些会将输入记录分隔符设置为>,将输入字段分隔符设置为换行符,将输出字段分隔符设置为仅用于第二个文件的制表符(因为这些变量是在第一个文件名之后分配的)
  • $1 in a{print substr($2, 2, 7), $1}如果第一个字段作为数组中的键存在a,则打印所需的详细信息

如果行尾可以有尾随空格字符,请使用:

$ awk 'NR==FNR{sub(/[[:space:]]+$/, ""); a[$0]; next}
       $1 in a{print substr($2, 2, 6), $1}
      ' gene.txt RS='>' FS='[[:space:]]*\n' OFS='\t' fasta.txt
Run Code Online (Sandbox Code Playgroud)