如何为每个部分围绕一个特定字母分隔5个字母的数据

Lea*_*ner 2 awk sed

我有这样的数据

>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTFHGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGFTNVNFGRSRSAQEPARKKQDPPVTHDLRVSLEEIYSGCTKKMKISHK
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGETPGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVATIPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIESTPELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGREVENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELEEELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRWTEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVK
Run Code Online (Sandbox Code Playgroud)

我试图为每个部分找到每个F左边5个字母和5个字母,然后计算每个部分中E或D的数量

代表性输出如下所示

 >sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
    RQCSWFAGCTN   0  0
    LLYQLFRNLFC   0  0
    LFRNLFCSYGL   0  0
    NNSGLFFLCGN   0  0
    NSGLFFLCGNG   0  0
    GVYKGFPPKWS   0  0
    TNLRSFIHKVT   0  0
    >sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
    GQKITFHGEGD   1  1
    KDHAVFTRRGE   1  1
    RGEDLFMCMDI   1  2
    EALCGFQKPIS   1  0
    RLIIEFKVNFP   1  0
    EFKVNFPENGF   2  0
    FPENGFLSPDK   1  0
    >sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
    ATTPQFPDMIL   0  1
    RGHSHFVSDVV   0  1
    SSDGQFALSGS   0  1
    TTTRRFVGHTK   0  0
    VLSVAFSSDNR   0  1
    VSCVRFSPNSS   0  0
    >sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
    VLIKLFCVHTK   0  0 
    DVQIRFQPQL    0  1
    >sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
    LLLLAFLLLPR   0  0
    KRCGGFLIRDD   0  2
    LIRDDFVLTAA   0  2
    EPTQQFIPVKR   1  0
    YNPKNFSNDIM   0  1
    >sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
    GAEEKFKEIAE   4  0
    RKREIFDRYGE   2  1
    ANGTSFSYTFH   0  0
    SFSYTFHGDPH   0  1
    DPHAMFAEFFG   0  1
    AMFAEFFGGRN   1  0
    MFAEFFGGRNP   1  0
    GGRNPFDTFFG   0  1
    NPFDTFFGQRN   0  1
    PFDTFFGQRNG   0  1
    DIDDPFSGFPM   0  3
    DPFSGFPMGMG   0  1 
    MGMGGFTNVNF   0  0 
    FTNVNFGRSRS   0  0
    >sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
    DGYVKFWQIYI   0  1   
    LSCLLFCDNHK   0  1
    DPDVPFWRFLI   0  2
    VPFWRFLITGA   0  0
    LQTIRFSPDIF   0  1
    FSPDIFSSVSV   0  1
    EGHACFSSISE   0  0
    SSISEFLLTHP   1  0
    HPVLSFGIQVV   0  0
    VLIKLFCVHTK   0  0
    DVQIRFQPQLN   0  1
    TAHEDFTFGES   2  1
    HEDFTFGESRP   2  1
    PAPADFLSLSS   0  1
    MTPDAFMTPSA   0  1
    >sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
    ISLGIFPLPAG   0  0
    SEQWKFQELSQ   2  0
    EENEGFVKVTD   3  1
    IENKAFDRNTE   2  1
    NTESLFEELSS   3  0
    AQRKRFTRVEM   1  0
    SSIWQFFSRLF   0  0
    SIWQFFSRLFS   0  0
    FFSRLFSSSSN   0  0
Run Code Online (Sandbox Code Playgroud)

在乞讨时我想到找到左边和右边的5个字母,但是我无法弄清楚如何做到这一点

Jam*_*own 5

在awk中:

$ awk '
NR%2 {print; next }                                # print every odd record
{                                                  # the even records are processed
    while(match($0,/.{5}F.{0,5}/)) {               # get 5 before and upto 5 after F
        # 5 before F ^^^   ^^^ 0-5 chars after F 
        # change to /.{0,5}F.{0,5}/ if needed
        print s=substr($0,RSTART,RLENGTH),         # print match
              gsub(/E/,"E",s),                     # count of Es
              gsub(/D/,"D",s)                      # count of Ds
        $0=substr($0,RSTART+1)                     # shorten the search string
    }
}' file
Run Code Online (Sandbox Code Playgroud)

一些输出:

>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RQCSWFAGCTN 0 0
LLYQLFRNLFC 0 0    # notice another F in the 5+F+5 window
LFRNLFCSYGL 0 0    # .. getting handled
NNSGLFFLCGN 0 0
NSGLFFLCGNG 0 0
GVYKGFPPKWS 0 0
TNLRSFIHKVT 0 0
...
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
VLIKLFCVHTK 0 0
DVQIRFQPQL 0 1     # ...F{0,5}
Run Code Online (Sandbox Code Playgroud)