我有一个名为aa_seq的几百个氨基酸序列表,它看起来像这样:['AFYIVHPMFSELINFQNEGHECQCQCG','KVHSLPGMSDNGSPAVLPKTEFNKYKI','RAQVEDLMSLSPHVENASIPKGSTPIP','TSTNNYPMVQEQAILSCIEQTMVADAK',...].每个序列长27个字母.我必须确定每个位置(1-27)最常用的氨基酸以及它的频率.
到目前为止,我有:
count_dict = {}
counter = count_dict.values()
aa_list = ['A', 'C', 'D', 'E' ,'F' ,'G' ,'H' ,'I' ,'K' ,'L' , #one-letter code for amino acids
'M' ,'N' ,'P' ,'Q' ,'R' ,'S' ,'T' ,'V' ,'W' ,'Y']
for p in range(0,26): #first round:looks at the first position in each sequence
for s in range(0,len(aa_seq)): #goes through all sequences of the list
for item in aa_list: #and checks for the occurrence of each amino acid letter (=item)
if item in aa_seq[s][p]:
count_dict[item] #if that letter occurs at the respective position, make it a key in the dictionary
counter += 1 #and increase its counter (the value, as definded above) by one
print count_dict
Run Code Online (Sandbox Code Playgroud)
它说KeyError:'A',它指向行count_dict [item].所以aa_list的项目显然不能以这种方式添加为关键字..?我怎么做?它还给出了一个错误"'int'对象不可迭代"关于计数器.如何增加柜台?
你可以使用Counter类
>>> from collections import Counter
>>> l = ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI', 'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK']
>>> s = [Counter([l[j][i] for j in range(len(l))]).most_common()[0] for i in range(27)]
>>> s
[('A', 1),
('A', 1),
('Y', 1),
('I', 1),
('N', 1),
('Y', 1),
('P', 2),
('M', 4),
('S', 2),
('Q', 1),
('E', 2),
('Q', 1),
('I', 1),
('I', 1),
('A', 1),
('Q', 1),
('A', 1),
('I', 1),
('I', 1),
('Q', 1),
('E', 2),
('C', 1),
('Q', 1),
('A', 1),
('Q', 1),
('I', 1),
('I', 1)]
Run Code Online (Sandbox Code Playgroud)
但是,如果您拥有大型数据集,我可能会效率低下.
| 归档时间: |
|
| 查看次数: |
173 次 |
| 最近记录: |