我有一个包含20 000个探针的数据集,它们分为两列,每列21个.从这个文件中我需要提取探针1列中最后一个核苷酸与探针2列中最后一个核苷酸匹配的行.到目前为止,我尝试了AWK(substr)函数,但没有得到预期的结果.这是我试过的单线:
awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}'
Run Code Online (Sandbox Code Playgroud)
另一种选择是在第2列和第4列(awk '$2~/[A-Z]$/)中锚定最后一个字符,但我找不到使用正则表达式匹配两列中探针的方法.所有的建议和意见将非常感谢.
数据集示例:
Probe 1 Probe 2
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4738 GGAGGATTTGGCCGGAGAGGC C GGAGGAGGAGGAGGACGAGGT
4739 GGAGGAAGAGGAGGGGGAGGT D GGAGGACGAGGAGGAGGAGGC
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
Run Code Online (Sandbox Code Playgroud)
期望的输出:
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
Run Code Online (Sandbox Code Playgroud) 我有一个10000行的文件,如下所示:
Peptidyl-prolyl cis-trans isomerase A OS=Homo sapiens GN=PPIA PE=1 SV=2 - [PPIA] 0.8622399654 3.2730004556
Run Code Online (Sandbox Code Playgroud)
我无法弄清楚如何删除部分字符串到方括号,以便最终输出如下所示:
[PPIA] 0.8622399654 3.2730004556
Run Code Online (Sandbox Code Playgroud)
到目前为止,我尝试了python re.sub,但无法将其与行的开头匹配.
我有一个20000探针的列表,有没有办法使用sed/awk提取每个探针的前三行/出现?
Example of dataset:
Probe1 A GTTAGAGGAGGTGGAAGAGC
Probe1 B CTGAGGTCGGGACGGAGCAC
Probe1 C GATGTAGGCGGTTGGCGTGG
Probe1 D GTTGGCGAAGTCACATCTAG
Probe1 E CATGTCGCCGACTCCGTCGA
Probe1 F GTGATGTTCTGAGTACATAG
Probe3 A GATTGTAGGTTTCCTGCCAG
Probe3 L ACCCAGCCAGGGGAAAACCA
Probe3 Z GGAGATGTAGGCGGTTGGCG
Probe3 Y GGAGATGTAGGCCTTAAAAA
Probe3 D GATTGTAGGGGTCCTGCCAG
Run Code Online (Sandbox Code Playgroud)
期望的输出:
Probe1 A GTTAGAGGAGGTGGAAGAGC
Probe1 B CTGAGGTCGGGACGGAGCAC
Probe1 C GATGTAGGCGGTTGGCGTGG
Probe3 A GATTGTAGGTTTCCTGCCAG
Probe3 L ACCCAGCCAGGGGAAAACCA
Probe3 Z GGAGATGTAGGCGGTTGGCG
Run Code Online (Sandbox Code Playgroud)