区分大小写的替换;相同的目标 ID

use*_*677 4 sed python perl

我自己很难在文本文件中进行区分大小写的替换。请在下面找到我作为 sed -f file.sed < input.txt > output.txt 运行的 sed 文件的一部分

 s/\<code_229633_13\>/R77_08349T0/
 s/\<code_229633_138\>/R77_09738T0/
 s/\<code_230519_10\>/R77_04813T0/
 s/\<code_230519_1\>/R77_13591T0/
 s/\<code_230519_13\>/R77_05463T0/
 up to line 14521....
Run Code Online (Sandbox Code Playgroud)

代码工作得很好,但我也有这样的情况,我有 2 个或更多目标 ID(code_010512_23 和 code_299097_0)重叠相同的替换 ID(R77_14520T0),我希望输出类似 R77_14520T0.a 和 R77_14520T0.b (行如下1和2)

s/code_010512_23/R77_14520T0/ --> R77_14520T0.a
s/code_299097_0/R77_14520T0/ --> R77_14520T0.b
Run Code Online (Sandbox Code Playgroud)

此外,一个更复杂但类似的情况是当我有以下输入文件(input2.txt 文件)时

  ID=gene09464;Name=code_229633_13;isoforms=1           
  ID=mRNA10661;Parent=gene09464;Name=code_229633_13         
  ID=exon26192;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0  1   1093    +
  ID=exon26193;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0  1094    1873    +

  ID=gene09491;Name=code_229633_138;isoforms=1          
  ID=mRNA10690;Parent=gene09491;Name=code_229633_138            
  ID=exon26252;Parent=mRNA10690;Name=code_229633_138;Target=R77_09738T0 1   411 +

  ID=gene09513;Name=code_230519_10;isoforms=1           
  ID=mRNA10715;Parent=gene09513;Name=code_230519_10         
  ID=exon26311;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0  1   59  +
  ID=exon26312;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0  60  186 +

  ID=gene09511;Name=code_230519_1;isoforms=1            
  ID=mRNA10713;Parent=gene09511;Name=code_230519_1          
  ID=exon26308;Parent=mRNA10713;Name=code_230519_1;Target=R77_13591T0   1   1075    +
  ID=exon26309;Parent=mRNA10713;Name=code_230519_1;Target=R77_13591T0   1076    1128    +

  ID=gene09514;Name=code_230519_13;isoforms=1           
  ID=mRNA10716;Parent=gene09514;Name=code_230519_13         
  ID=exon26316;Parent=mRNA10716;Name=code_230519_13;Target=R77_05463T0  1   219 +

  ID=gene00865;Name=code_010512_23;isoforms=1           
  ID=mRNA00979;Parent=gene00865;Name=code_010512_23         
  ID=exon02477;Parent=mRNA00979;Name=code_010512_23;Target=R77_14520T0  1   143 +

  ID=gene14561;Name=code_299097_0;isoforms=2            
  ID=mRNA16419;Parent=gene14561;Name=code_299097_0          
  ID=exon39828;Parent=mRNA16419;Name=code_299097_0;Target=R77_14520T0   144 193 +
  ID=mRNA16420;Parent=gene14561;Name=code_299097_0          
  ID=exon39828;Parent=mRNA16420;Name=code_299097_0;Target=R77_15554T0   408 457 +
Run Code Online (Sandbox Code Playgroud)

我需要仅在包含单词“isoforms”的行上应用与之前相同的方式的替换,换句话说,在第 1,6,10,15,20,24 和 28 行中,而不是文本中的其他地方。 该输入文件的格式将与“isoforms”行之间的空行所描述的完全相同。

我想要的输出

 ID=gene09464;Name=R77_08349T0;isoforms=1           
 ID=mRNA10661;Parent=gene09464;Name=code_229633_13          
 ID=exon26192;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0   1   1093    +
 ID=exon26193;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0   1094    1873    +
 ID=exon26194;Parent=mRNA10661;Name=code_229633_13;Target=R77_08349T0   1874    4065    +

 ID=gene09491;Name=R77_09738T0;isoforms=1           
 ID=mRNA10690;Parent=gene09491;Name=code_229633_138         
 ID=exon26252;Parent=mRNA10690;Name=code_229633_138;Target=R77_09738T0  1   411 +

 ID=gene09513;Name=Target=R77_04813T0;isoforms=1            
 ID=mRNA10715;Parent=gene09513;Name=code_230519_10          
 ID=exon26311;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0   1   59  +
 ID=exon26312;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0   60  186 +
 ID=exon26313;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0   187 678 +
 ID=exon26314;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0   679 1399    +
 ID=exon26315;Parent=mRNA10715;Name=code_230519_10;Target=R77_04813T0   1400    1402    +

 ID=gene09511;Name=R77_13591T0;isoforms=1           
 ID=mRNA10713;Parent=gene09511;Name=code_230519_1           
 ID=exon26308;Parent=mRNA10713;Name=code_230519_1;Target=R77_13591T0    1   1075    +
 ID=exon26309;Parent=mRNA10713;Name=code_230519_1;Target=R77_13591T0    1076    1128    +

 ID=gene09514;Name=R77_05463T0;isoforms=1           
 ID=mRNA10716;Parent=gene09514;Name=code_230519_13          
 ID=exon26316;Parent=mRNA10716;Name=code_230519_13;Target=R77_05463T0   1   219 +

 ID=gene00865;Name=R77_14520T0.a;isoforms=1         
 ID=mRNA00979;Parent=gene00865;Name=code_010512_23          
 ID=exon02477;Parent=mRNA00979;Name=code_010512_23;Target=R77_14520T0   1   143 +

 ID=gene14561;Name=R77_14520T0.b;isoforms=2         
 ID=mRNA16419;Parent=gene14561;Name=code_299097_0           
 ID=exon39828;Parent=mRNA16419;Name=code_299097_0;Target=R77_14520T0    144 193 +
 ID=mRNA16420;Parent=gene14561;Name=code_299097_0           
 ID=exon39828;Parent=mRNA16420;Name=code_299097_0;Target=R77_15554T0    408 457 +
Run Code Online (Sandbox Code Playgroud)

ter*_*don 5

你不能真正用 来做这种事情sed,它只是一个文本流编辑器。试试这个 Perl 脚本:

#!/usr/bin/env perl 

## Set the record separator to \n\n to
## read multiple lines as a single record
$/="\n\n";
## This array will contain all lines of the file
my @lines=<>;

## The list of suffixes
@suffix=(a..z); 

## For each line of the input file
foreach (@lines) {
    ## If the current line (lines are now the actual multiline records
    ## because we set $/ to consecutive newlines) is one we are interested in.
    if (/isoforms.*?Target=(\S+)/s){
    ## Keep a list of seen targets
    $seen{$1}++;
    }

}
## Now that we have processed the entire file
## go back and print each line.
foreach (@lines) {

    ## If this line is one of the ones we're interested in
    if(/Name=(.+?);.*?isoforms=.*?Target=(\S+)/s){
    $name=$1; $target=$2;
    ## This is needed so we can know whether
    ## how many times we've seen this target so far.
    $newseen{$target}++;
    ## If this target exists more than once in the input file
    if ($seen{$target}>1) {
        ## Use the %newseen hash to choose the right letter.
        ## The -1 is needed because the first element of an
        ## array is 0, not 1.
        s/$name/$target.$suffix[$newseen{$target}-1]/;
    }
    else {
        s/$name/$target/;
    }
    }
    print;
}
Run Code Online (Sandbox Code Playgroud)

将上面的脚本保存为foo.pl,使其可执行 ( chmod a+x foo.pl) 并在您的输入文件上运行:

./foo.pl input.txt > output.txt
Run Code Online (Sandbox Code Playgroud)