我有这样的数据
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTFHGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGFTNVNFGRSRSAQEPARKKQDPPVTHDLRVSLEEIYSGCTKKMKISHK
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGETPGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVATIPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIESTPELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGREVENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELEEELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRWTEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVK
Run Code Online (Sandbox Code Playgroud)
我试图为每个部分找到每个F左边5个字母和5个字母,然后计算每个部分中E或D的数量
代表性输出如下所示
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RQCSWFAGCTN 0 0
LLYQLFRNLFC 0 0
LFRNLFCSYGL 0 0
NNSGLFFLCGN 0 0
NSGLFFLCGNG 0 0
GVYKGFPPKWS 0 0
TNLRSFIHKVT 0 0
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
GQKITFHGEGD 1 1
KDHAVFTRRGE 1 1
RGEDLFMCMDI 1 2
EALCGFQKPIS 1 0
RLIIEFKVNFP 1 0
EFKVNFPENGF 2 0
FPENGFLSPDK 1 0
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
ATTPQFPDMIL 0 1
RGHSHFVSDVV 0 1
SSDGQFALSGS 0 1
TTTRRFVGHTK 0 0
VLSVAFSSDNR 0 1
VSCVRFSPNSS 0 0
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
VLIKLFCVHTK 0 0
DVQIRFQPQL 0 1
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
LLLLAFLLLPR 0 0
KRCGGFLIRDD 0 2
LIRDDFVLTAA 0 2
EPTQQFIPVKR 1 0
YNPKNFSNDIM 0 1
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
GAEEKFKEIAE 4 0
RKREIFDRYGE 2 1
ANGTSFSYTFH 0 0
SFSYTFHGDPH 0 1
DPHAMFAEFFG 0 1
AMFAEFFGGRN 1 0
MFAEFFGGRNP 1 0
GGRNPFDTFFG 0 1
NPFDTFFGQRN 0 1
PFDTFFGQRNG 0 1
DIDDPFSGFPM 0 3
DPFSGFPMGMG 0 1
MGMGGFTNVNF 0 0
FTNVNFGRSRS 0 0
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
DGYVKFWQIYI 0 1
LSCLLFCDNHK 0 1
DPDVPFWRFLI 0 2
VPFWRFLITGA 0 0
LQTIRFSPDIF 0 1
FSPDIFSSVSV 0 1
EGHACFSSISE 0 0
SSISEFLLTHP 1 0
HPVLSFGIQVV 0 0
VLIKLFCVHTK 0 0
DVQIRFQPQLN 0 1
TAHEDFTFGES 2 1
HEDFTFGESRP 2 1
PAPADFLSLSS 0 1
MTPDAFMTPSA 0 1
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
ISLGIFPLPAG 0 0
SEQWKFQELSQ 2 0
EENEGFVKVTD 3 1
IENKAFDRNTE 2 1
NTESLFEELSS 3 0
AQRKRFTRVEM 1 0
SSIWQFFSRLF 0 0
SIWQFFSRLFS 0 0
FFSRLFSSSSN 0 0
Run Code Online (Sandbox Code Playgroud)
在乞讨时我想到找到左边和右边的5个字母,但是我无法弄清楚如何做到这一点
在awk中:
$ awk '
NR%2 {print; next } # print every odd record
{ # the even records are processed
while(match($0,/.{5}F.{0,5}/)) { # get 5 before and upto 5 after F
# 5 before F ^^^ ^^^ 0-5 chars after F
# change to /.{0,5}F.{0,5}/ if needed
print s=substr($0,RSTART,RLENGTH), # print match
gsub(/E/,"E",s), # count of Es
gsub(/D/,"D",s) # count of Ds
$0=substr($0,RSTART+1) # shorten the search string
}
}' file
Run Code Online (Sandbox Code Playgroud)
一些输出:
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RQCSWFAGCTN 0 0
LLYQLFRNLFC 0 0 # notice another F in the 5+F+5 window
LFRNLFCSYGL 0 0 # .. getting handled
NNSGLFFLCGN 0 0
NSGLFFLCGNG 0 0
GVYKGFPPKWS 0 0
TNLRSFIHKVT 0 0
...
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
VLIKLFCVHTK 0 0
DVQIRFQPQL 0 1 # ...F{0,5}
Run Code Online (Sandbox Code Playgroud)