Tim*_*Tim 29 python numpy r pandas
我正在生成许多具有相同形状的数据帧,我想将它们相互比较.我希望能够获得数据帧的均值和中位数.
Source.0 Source.1 Source.2 Source.3
cluster
0 0.001182 0.184535 0.814230 0.000054
1 0.000001 0.160490 0.839508 0.000001
2 0.000001 0.173829 0.826114 0.000055
3 0.000432 0.180065 0.819502 0.000001
4 0.000152 0.157041 0.842694 0.000113
5 0.000183 0.174142 0.825674 0.000001
6 0.000001 0.151556 0.848405 0.000038
7 0.000771 0.177583 0.821645 0.000001
8 0.000001 0.202059 0.797939 0.000001
9 0.000025 0.189537 0.810410 0.000028
10 0.006142 0.003041 0.493912 0.496905
11 0.003739 0.002367 0.514216 0.479678
12 0.002334 0.001517 0.529041 0.467108
13 0.003458 0.000001 0.532265 0.464276
14 0.000405 0.005655 0.527576 0.466364
15 0.002557 0.003233 0.507954 0.486256
16 0.004161 0.000001 0.491271 0.504568
17 0.001364 0.001330 0.528311 0.468996
18 0.002886 0.000001 0.506392 0.490721
19 0.001823 0.002498 0.509620 0.486059
Source.0 Source.1 Source.2 Source.3
cluster
0 0.000001 0.197108 0.802495 0.000396
1 0.000001 0.157860 0.842076 0.000063
2 0.094956 0.203057 0.701662 0.000325
3 0.000001 0.181948 0.817841 0.000210
4 0.000003 0.169680 0.830316 0.000001
5 0.000362 0.177194 0.822443 0.000001
6 0.000001 0.146807 0.852924 0.000268
7 0.001087 0.178994 0.819564 0.000354
8 0.000001 0.202182 0.797333 0.000485
9 0.000348 0.181399 0.818252 0.000001
10 0.003050 0.000247 0.506777 0.489926
11 0.004420 0.000001 0.513927 0.481652
12 0.006488 0.001396 0.527197 0.464919
13 0.001510 0.000001 0.525987 0.472502
14 0.000001 0.000001 0.520737 0.479261
15 0.000001 0.001765 0.515658 0.482575
16 0.000001 0.000001 0.492550 0.507448
17 0.002855 0.000199 0.526535 0.470411
18 0.000001 0.001952 0.498303 0.499744
19 0.001232 0.000001 0.506612 0.492155
Run Code Online (Sandbox Code Playgroud)
然后我想得到这两个数据帧的平均值.
最简单的方法是什么?
只是为了澄清我希望在所有数据帧的索引和列完全相同时获得每个特定单元格的均值.
所以在我给出的例子中,平均值为[0,Source.0]
(0.001182 + 0.000001)/ 2 = 0.0005915.
ali*_*i_m 30
假设两个数据帧具有相同的列,您可以将它们连接起来并计算连接帧上的摘要统计信息:
import numpy as np
import pandas as pd
# some random data frames
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
# concatenate them
df_concat = pd.concat((df1, df2))
print df_concat.mean()
# x -0.163044
# y 2.120000
# dtype: float64
print df_concat.median()
# x -0.192037
# y 2.000000
# dtype: float64
Run Code Online (Sandbox Code Playgroud)
如果要计算两个数据集中具有相同索引的每组行的统计数据,可以使用.groupby()
按行索引对数据进行分组,然后应用均值,中位数等:
by_row_index = df_concat.groupby(df_concat.index)
df_means = by_row_index.mean()
print df_means.head()
# x y
# 0 -0.850794 1.5
# 1 0.159038 1.5
# 2 0.083278 1.0
# 3 -0.540336 0.5
# 4 0.390954 3.5
Run Code Online (Sandbox Code Playgroud)
即使您的数据帧具有不相等的行数,此方法也将起作用 - 如果两个数据帧之一中缺少特定的行索引,则将在单个现有行上计算平均值/中值.
Foo*_*Bar 15
我和@ali_m类似,但由于你想要每行 - 列组合一个均值,我的结论不同:
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df = pd.concat([df1, df2])
foo = df.groupby(level=1).mean()
foo.head()
x y
0 0.841282 2.5
1 0.716749 1.0
2 -0.551903 2.5
3 1.240736 1.5
4 1.227109 2.0
Run Code Online (Sandbox Code Playgroud)
这是一个解决方案,首先解开两个数据帧,使它们与多索引(集群,列名)串联......然后你可以使用系列加法和除法,它会自动对索引进行操作,最后解开它们......这里是代码...
averages = (df1.stack()+df2.stack())/2
averages = averages.unstack()
Run Code Online (Sandbox Code Playgroud)
你完成了...
或者用于更一般的目的...
dfs = [df1,df2]
averages = pd.concat([each.stack() for each in dfs],axis=1)\
.apply(lambda x:x.mean(),axis=1)\
.unstack()
Run Code Online (Sandbox Code Playgroud)
您可以将标签简单地分配给每一帧,调用它group
,然后concat
与groupby
做你想做什么:
In [57]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [58]: df2 = df.copy()
In [59]: dfs = [df, df2]
In [60]: df
Out[60]:
a b c d
0 0.1959 0.1260 0.1464 0.1631
1 0.9344 -1.8154 1.4529 -0.6334
2 0.0390 0.4810 1.1779 -1.1799
3 0.3542 0.3819 -2.0895 0.8877
4 -2.2898 -1.0585 0.8083 -0.2126
5 0.3727 -0.6867 -1.3440 -1.4849
6 -1.1785 0.0885 1.0945 -1.6271
7 -1.7169 0.3760 -1.4078 0.8994
8 0.0508 0.4891 0.0274 -0.6369
9 -0.7019 1.0425 -0.5476 -0.5143
In [61]: for i, d in enumerate(dfs):
....: d['group'] = i
....:
In [62]: dfs[0]
Out[62]:
a b c d group
0 0.1959 0.1260 0.1464 0.1631 0
1 0.9344 -1.8154 1.4529 -0.6334 0
2 0.0390 0.4810 1.1779 -1.1799 0
3 0.3542 0.3819 -2.0895 0.8877 0
4 -2.2898 -1.0585 0.8083 -0.2126 0
5 0.3727 -0.6867 -1.3440 -1.4849 0
6 -1.1785 0.0885 1.0945 -1.6271 0
7 -1.7169 0.3760 -1.4078 0.8994 0
8 0.0508 0.4891 0.0274 -0.6369 0
9 -0.7019 1.0425 -0.5476 -0.5143 0
In [63]: final = pd.concat(dfs, ignore_index=True)
In [64]: final
Out[64]:
a b c d group
0 0.1959 0.1260 0.1464 0.1631 0
1 0.9344 -1.8154 1.4529 -0.6334 0
2 0.0390 0.4810 1.1779 -1.1799 0
3 0.3542 0.3819 -2.0895 0.8877 0
4 -2.2898 -1.0585 0.8083 -0.2126 0
5 0.3727 -0.6867 -1.3440 -1.4849 0
6 -1.1785 0.0885 1.0945 -1.6271 0
.. ... ... ... ... ...
13 0.3542 0.3819 -2.0895 0.8877 1
14 -2.2898 -1.0585 0.8083 -0.2126 1
15 0.3727 -0.6867 -1.3440 -1.4849 1
16 -1.1785 0.0885 1.0945 -1.6271 1
17 -1.7169 0.3760 -1.4078 0.8994 1
18 0.0508 0.4891 0.0274 -0.6369 1
19 -0.7019 1.0425 -0.5476 -0.5143 1
[20 rows x 5 columns]
In [65]: final.groupby('group').mean()
Out[65]:
a b c d
group
0 -0.394 -0.0576 -0.0682 -0.4339
1 -0.394 -0.0576 -0.0682 -0.4339
Run Code Online (Sandbox Code Playgroud)
在这里,每个group
都是相同的,但这只是因为df == df2
.
或者,您可以将帧抛出到Panel
:
In [69]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [70]: df2 = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [71]: panel = pd.Panel({0: df, 1: df2})
In [72]: panel
Out[72]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 10 (major_axis) x 4 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 9
Minor_axis axis: a to d
In [73]: panel.mean()
Out[73]:
0 1
a 0.3839 0.2956
b 0.1855 -0.3164
c -0.1167 -0.0627
d -0.2338 -0.0450
Run Code Online (Sandbox Code Playgroud)
根据Niklas的评论,该问题的解决方案是panel.mean(axis=0)
。
作为更完整的示例:
import pandas as pd
import numpy as np
dfs = {}
nrows = 4
ncols = 3
for i in range(4):
dfs[i] = pd.DataFrame(np.arange(i, nrows*ncols+i).reshape(nrows, ncols),
columns=list('abc'))
print('DF{i}:\n{df}\n'.format(i=i, df=dfs[i]))
panel = pd.Panel(dfs)
print('Mean of stacked DFs:\n{df}'.format(df=panel.mean(axis=0)))
Run Code Online (Sandbox Code Playgroud)
将给出以下输出:
DF0:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
DF1:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
DF2:
a b c
0 2 3 4
1 5 6 7
2 8 9 10
3 11 12 13
DF3:
a b c
0 3 4 5
1 6 7 8
2 9 10 11
3 12 13 14
Mean of stacked DFs:
a b c
0 1.5 2.5 3.5
1 4.5 5.5 6.5
2 7.5 8.5 9.5
3 10.5 11.5 12.5
Run Code Online (Sandbox Code Playgroud)