kso*_*all 8 python dataframe pandas
我的数据框是一长串 4 个字母'A', 'T', 'G','C',我需要按索引计算每个字母的频率
df = pd.DataFrame({'cases': ['ACCTTGTAGTGTATTTTATGACCAAATGACTTTTTCCCCCCAGTGGCTAATTTGTCTCAGGCCTGCGTCTTAAAGAGACACGGTAATGAGTAGGAAGTCCAGCGTGGTCTGGA','ACCTTGTACTGTATCTTATGACCAGATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGTAATGAGTAGGAAGTCCAACGTGGTCTAGA','GCCTTGTACTGTATATTATGACCAAATGACTTTTTCCACCCATTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGAAATGAGTAGGAAGTCCAGCGTGGTCTAGA','ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGTAATGAGTAGGAAGTCCAGCGTGGTCTAGA']})
Run Code Online (Sandbox Code Playgroud)
cases
0 ACCTTGTAGTGTATTTTATGACCAAATGACTTTTTCCCCCCAGTGG...
1 ACCTTGTACTGTATCTTATGACCAGATGACTTTTTCCACCCAGTGG...
2 GCCTTGTACTGTATATTATGACCAAATGACTTTTTCCACCCATTGG...
3 ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGG...
4 ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGG...
5 ACCTTGTAGTGTATTTTATGACCAAATGACTTTTTCCCCCCAGTGG...
6 ACCTTGTACTGTATCTTATGACCAGATGACTTTTTCCACCCAGTGG...
7 GCCTTGTACTGTATATTATGACCAAATGACTTTTTCCACCCATTGG...
8 ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGG...
9 ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGG...
Run Code Online (Sandbox Code Playgroud)
结果将是一个新的 df 形状4x113,我想不出一个熊猫的方法来做到这一点。以下是我的非熊猫解决方案
def freq_lists(dna_list):
n = len(dna_list[0])
A = [0]*n
T = [0]*n
G = [0]*n
C = [0]*n
for dna in dna_list:
for index, base in enumerate(dna):
if base == 'A':
A[index] += 1
elif base == 'C':
C[index] += 1
elif base == 'G':
G[index] += 1
elif base == 'T':
T[index] += 1
return {'A': A, 'C': C, 'G': G, 'T': T}
fdf = pd.DataFrame(freq_lists(df['cases'].to_list()))
Run Code Online (Sandbox Code Playgroud)
A C G T
0 3 0 1 0
1 0 4 0 0
2 0 4 0 0
3 0 0 0 4
4 0 0 0 4
.. .. .. .. ..
108 0 4 0 0
109 0 0 0 4
110 3 0 1 0
111 0 0 4 0
112 4 0 0 0
Run Code Online (Sandbox Code Playgroud)
为了澄清第一行是通过总结case列中第一个 str 的计数获得的AAGA -> A: 3, C:0, G:1 T:0
让我们做explode与crosstab
s = df.cases.map(list).explode()
out = pd.crosstab(s.groupby(level=0).cumcount(),s)
Out[583]:
cases A C G T
row_0
0 3 0 1 0
1 0 4 0 0
2 0 4 0 0
3 0 0 0 4
4 0 0 0 4
.. .. .. ..
108 0 4 0 0
109 0 0 0 4
110 3 0 1 0
111 0 0 4 0
112 4 0 0 0
Run Code Online (Sandbox Code Playgroud)
使用集合。计数器:
from collections import Counter
df['cases'].apply(lambda x: pd.Series(Counter(x)))
Run Code Online (Sandbox Code Playgroud)
输出:
A C T G
0 27 24 34 28
1 29 26 33 25
2 30 25 33 25
3 29 25 33 26
Run Code Online (Sandbox Code Playgroud)
反之则不那么性感:
pd.DataFrame([Counter(i)
for i in list(zip(*df['cases'].apply(list).values))]
).fillna(0).astype(int)
Run Code Online (Sandbox Code Playgroud)
或者
(df['cases'].apply(lambda x: pd.Series(list(x)))
.apply(pd.value_counts)
.T.fillna(0).astype(int)
)
Run Code Online (Sandbox Code Playgroud)
输出:
A G C T
0 3 1 0 0
1 0 0 4 0
2 0 0 4 0
...
111 0 4 0 0
112 4 0 0 0
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
166 次 |
| 最近记录: |