在Python中对pandas中的数据帧进行分类

37 python numpy pandas

给出pandas中的以下数据帧:

import numpy as np
df = pandas.DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)})
Run Code Online (Sandbox Code Playgroud)

其中id是由以下组成的每个点的ID ab值,哪能仓ab成一组指定的仓(这样我可以再取中值/平均值ab每个仓中)? 对于任何给定的行,df可能具有或(或两者)的NaN值.谢谢.abdf

这是一个更好的例子,使用Joe Kington的解决方案和更逼真的df.我不确定的是如何访问下面每个df.a组的df.b元素:

a = np.random.random(20)
df = pandas.DataFrame({"a": a, "b": a + 10})
# bins for df.a
bins = np.linspace(0, 1, 10)
# bin df according to a
groups = df.groupby(np.digitize(df.a,bins))
# Get the mean of a in each group
print groups.mean()
## But how to get the mean of b for each group of a?
# ...
Run Code Online (Sandbox Code Playgroud)

Joe*_*ton 58

可能有一种更有效的方式(我有一种感觉pandas.crosstab在这里很有用),但这是我如何做到的:

import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100),
                       "b": np.random.random(100),
                       "id": np.arange(100)})

# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(np.digitize(df.a, bins))

# Get the mean of each bin:
print groups.mean() # Also could do "groups.aggregate(np.mean)"

# Similarly, the median:
print groups.median()

# Apply some arbitrary function to aggregate binned data
print groups.aggregate(lambda x: np.mean(x[x > 0.5]))
Run Code Online (Sandbox Code Playgroud)

编辑:由于OP专门询问了b由值组合的方法a,只需这样做

groups.mean().b
Run Code Online (Sandbox Code Playgroud)

此外,如果您希望索引看起来更好(例如显示间隔作为索引),就像在@ bdiamante的示例中那样,请使用pandas.cut而不是numpy.digitize.(感谢bidamante.我没有意识到pandas.cut存在.)

import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100), 
                       "b": np.random.random(100) + 10})

# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(pandas.cut(df.a, bins))

# Get the mean of b, binned by the values in a
print groups.mean().b
Run Code Online (Sandbox Code Playgroud)

这导致:

a
(0.00186, 0.111]    10.421839
(0.111, 0.22]       10.427540
(0.22, 0.33]        10.538932
(0.33, 0.439]       10.445085
(0.439, 0.548]      10.313612
(0.548, 0.658]      10.319387
(0.658, 0.767]      10.367444
(0.767, 0.876]      10.469655
(0.876, 0.986]      10.571008
Name: b
Run Code Online (Sandbox Code Playgroud)


bdi*_*nte 24

不是100%肯定这是否是你正在寻找的,但这是我认为你得到的:

In [144]: df = DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id":   np.arange(100)})

In [145]: bins = [0, .25, .5, .75, 1]

In [146]: a_bins = df.a.groupby(cut(df.a,bins))

In [147]: b_bins = df.b.groupby(cut(df.b,bins))

In [148]: a_bins.agg([mean,median])
Out[148]:
                 mean    median
a
(0, 0.25]    0.124173  0.114613
(0.25, 0.5]  0.367703  0.358866
(0.5, 0.75]  0.624251  0.626730
(0.75, 1]    0.875395  0.869843

In [149]: b_bins.agg([mean,median])
Out[149]:
                 mean    median
b
(0, 0.25]    0.147936  0.166900
(0.25, 0.5]  0.394918  0.386729
(0.5, 0.75]  0.636111  0.655247
(0.75, 1]    0.851227  0.838805
Run Code Online (Sandbox Code Playgroud)

当然,我不知道你有什么箱子,所以你必须把我的东西换成你的情况.


小智 14

Joe Kington的回答非常有用,但是,我注意到它没有包含所有数据.它实际上省略了a = a.min()的行.总结groups.size()得到99而不是100.

为了保证所有数据都被分箱,只需将bin数传入cut(),该函数将自动将第一个[last] bin填充0.1%,以确保包含所有数据.

df = pandas.DataFrame({"a": np.random.random(100), 
                    "b": np.random.random(100) + 10})

# Bin the data frame by "a" with 10 bins...
groups = df.groupby(pandas.cut(df.a, 10))

# Get the mean of b, binned by the values in a
print(groups.mean().b)
Run Code Online (Sandbox Code Playgroud)

在这种情况下,总结groups.size()得到100.

我知道这对于这个特殊问题是一个挑剔的观点,但对于我试图解决的类似问题,获得正确答案至关重要.