我自己很难在文本文件中进行区分大小写的替换。请在下面找到我作为 sed -f file.sed < input.txt > output.txt 运行的 sed 文件的一部分
s/\<code_229633_13\>/R77_08349T0/
s/\<code_229633_138\>/R77_09738T0/
s/\<code_230519_10\>/R77_04813T0/
s/\<code_230519_1\>/R77_13591T0/
s/\<code_230519_13\>/R77_05463T0/
up to line 14521....
Run Code Online (Sandbox Code Playgroud)
代码工作得很好,但我也有这样的情况,我有 2 个或更多目标 ID(code_010512_23 和 code_299097_0)重叠相同的替换 ID(R77_14520T0),我希望输出类似 R77_14520T0.a 和 R77_14520T0.b (行如下1和2)
s/code_010512_23/R77_14520T0/ --> R77_14520T0.a
s/code_299097_0/R77_14520T0/ --> R77_14520T0.b
Run Code Online (Sandbox Code Playgroud)
此外,一个更复杂但类似的情况是当我有以下输入文件(input2.txt 文件)时:
ID=gene09464;Name=code_229633_13;isoforms=1
ID=mRNA10661;Parent=gene09464;Name=code_229633_13
ID=exon26192;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0 1 1093 +
ID=exon26193;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0 1094 1873 +
ID=gene09491;Name=code_229633_138;isoforms=1
ID=mRNA10690;Parent=gene09491;Name=code_229633_138
ID=exon26252;Parent=mRNA10690;Name=code_229633_138;Target=R77_09738T0 1 411 +
ID=gene09513;Name=code_230519_10;isoforms=1
ID=mRNA10715;Parent=gene09513;Name=code_230519_10
ID=exon26311;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0 1 59 +
ID=exon26312;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0 60 186 +
ID=gene09511;Name=code_230519_1;isoforms=1
ID=mRNA10713;Parent=gene09511;Name=code_230519_1
ID=exon26308;Parent=mRNA10713;Name=code_230519_1;Target=R77_13591T0 1 1075 +
ID=exon26309;Parent=mRNA10713;Name=code_230519_1;Target=R77_13591T0 1076 1128 +
ID=gene09514;Name=code_230519_13;isoforms=1
ID=mRNA10716;Parent=gene09514;Name=code_230519_13
ID=exon26316;Parent=mRNA10716;Name=code_230519_13;Target=R77_05463T0 1 219 +
ID=gene00865;Name=code_010512_23;isoforms=1
ID=mRNA00979;Parent=gene00865;Name=code_010512_23
ID=exon02477;Parent=mRNA00979;Name=code_010512_23;Target=R77_14520T0 1 143 +
ID=gene14561;Name=code_299097_0;isoforms=2
ID=mRNA16419;Parent=gene14561;Name=code_299097_0
ID=exon39828;Parent=mRNA16419;Name=code_299097_0;Target=R77_14520T0 144 193 +
ID=mRNA16420;Parent=gene14561;Name=code_299097_0
ID=exon39828;Parent=mRNA16420;Name=code_299097_0;Target=R77_15554T0 408 457 +
Run Code Online (Sandbox Code Playgroud)
我需要仅在包含单词“isoforms”的行上应用与之前相同的方式的替换,换句话说,在第 1,6,10,15,20,24 和 28 行中,而不是文本中的其他地方。 该输入文件的格式将与“isoforms”行之间的空行所描述的完全相同。
我想要的输出
ID=gene09464;Name=R77_08349T0;isoforms=1
ID=mRNA10661;Parent=gene09464;Name=code_229633_13
ID=exon26192;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0 1 1093 +
ID=exon26193;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0 1094 1873 +
ID=exon26194;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0 1874 4065 +
ID=gene09491;Name=R77_09738T0;isoforms=1
ID=mRNA10690;Parent=gene09491;Name=code_229633_138
ID=exon26252;Parent=mRNA10690;Name=code_229633_138;Target=R77_09738T0 1 411 +
ID=gene09513;Name=Target=R77_04813T0;isoforms=1
ID=mRNA10715;Parent=gene09513;Name=code_230519_10
ID=exon26311;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0 1 59 +
ID=exon26312;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0 60 186 +
ID=exon26313;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0 187 678 +
ID=exon26314;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0 679 1399 +
ID=exon26315;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0 1400 1402 +
ID=gene09511;Name=R77_13591T0;isoforms=1
ID=mRNA10713;Parent=gene09511;Name=code_230519_1
ID=exon26308;Parent=mRNA10713;Name=code_230519_1;Target=R77_13591T0 1 1075 +
ID=exon26309;Parent=mRNA10713;Name=code_230519_1;Target=R77_13591T0 1076 1128 +
ID=gene09514;Name=R77_05463T0;isoforms=1
ID=mRNA10716;Parent=gene09514;Name=code_230519_13
ID=exon26316;Parent=mRNA10716;Name=code_230519_13;Target=R77_05463T0 1 219 +
ID=gene00865;Name=R77_14520T0.a;isoforms=1
ID=mRNA00979;Parent=gene00865;Name=code_010512_23
ID=exon02477;Parent=mRNA00979;Name=code_010512_23;Target=R77_14520T0 1 143 +
ID=gene14561;Name=R77_14520T0.b;isoforms=2
ID=mRNA16419;Parent=gene14561;Name=code_299097_0
ID=exon39828;Parent=mRNA16419;Name=code_299097_0;Target=R77_14520T0 144 193 +
ID=mRNA16420;Parent=gene14561;Name=code_299097_0
ID=exon39828;Parent=mRNA16420;Name=code_299097_0;Target=R77_15554T0 408 457 +
Run Code Online (Sandbox Code Playgroud)
你不能真正用 来做这种事情sed
,它只是一个文本流编辑器。试试这个 Perl 脚本:
#!/usr/bin/env perl
## Set the record separator to \n\n to
## read multiple lines as a single record
$/="\n\n";
## This array will contain all lines of the file
my @lines=<>;
## The list of suffixes
@suffix=(a..z);
## For each line of the input file
foreach (@lines) {
## If the current line (lines are now the actual multiline records
## because we set $/ to consecutive newlines) is one we are interested in.
if (/isoforms.*?Target=(\S+)/s){
## Keep a list of seen targets
$seen{$1}++;
}
}
## Now that we have processed the entire file
## go back and print each line.
foreach (@lines) {
## If this line is one of the ones we're interested in
if(/Name=(.+?);.*?isoforms=.*?Target=(\S+)/s){
$name=$1; $target=$2;
## This is needed so we can know whether
## how many times we've seen this target so far.
$newseen{$target}++;
## If this target exists more than once in the input file
if ($seen{$target}>1) {
## Use the %newseen hash to choose the right letter.
## The -1 is needed because the first element of an
## array is 0, not 1.
s/$name/$target.$suffix[$newseen{$target}-1]/;
}
else {
s/$name/$target/;
}
}
print;
}
Run Code Online (Sandbox Code Playgroud)
将上面的脚本保存为foo.pl
,使其可执行 ( chmod a+x foo.pl
) 并在您的输入文件上运行:
./foo.pl input.txt > output.txt
Run Code Online (Sandbox Code Playgroud)