如何只保留以字符开头的行和之后的行

Question

如何只保留以字符开头的行和之后的行

jas*_*ubs 3 grep sed awk text-processing

我有一个 FASTA 文件，它故意包含一些带有错误标头（即没有>）的序列和一些带有良好标头的序列。该文件格式良好，因为核苷酸序列在一行中。

例子：

2865958
AACTACTACAG
>hCoV-19/2832832
ACTCGGGGGG
28328332
ATTCCCCG
>hCoV-19/2789877
ACTCGGCCC

Run Code Online (Sandbox Code Playgroud)

而且我只想保留带有正确标题（即以开头的行>）的序列，如下所示：

>hCoV-19/2832832
ACTCGGGGGG
>hCoV-19/2789877
ACTCGGCCC

Run Code Online (Sandbox Code Playgroud)

我为此尝试了各种方法（sed、grep、awk）但没有正确的结果：

awk '/^>/ { ok=index($0,"hCoV")!=0;} {if(ok) print;}' combined_v4.fa > combined_v5.fa

sed -n '/^>.*hCoV/,/^>/ {/^>.*hCoV/p ; /^>/! p}' combined_v4.fa > combined_v5.fa

grep -w ">" -A 1 combined_v4.fa > combined_v5.fa

Run Code Online (Sandbox Code Playgroud)

你知道怎么做吗？

Answer 1

Pan*_*nki 8

Tellgrep也查找以开头>的行，并包括它后面的行：

grep -A1 --no-group-separator '^>' combined_v4.fa > combined_v5.fa

Run Code Online (Sandbox Code Playgroud)

如果您的版本grep不支持--no-group-separator，请尝试以下操作：

grep -A1 '^>' combined_v4.fa | grep -v '^--$' > combined_v5.fa

Run Code Online (Sandbox Code Playgroud)

遗憾的是，`--no-group-separator` 没有出现在 `man` 中。不过，它记录在 `info grep invoking command-line context` 中。 (4认同)

归档时间：	4 年，8 月前
查看次数：	371 次
最近记录：	4 年，8 月前