按值范围对数据进行分组

Pre*_*cks 17 python-2.7 pandas

我有一个csv文件,显示订单中的零件.列包括天晚,数量和商品.

我需要将数据分组数天,将商品与数量之和进行分组.然而,延迟的日子需要分为几个范围.

>56
>35 and <= 56
>14 and <= 35
>0 and <=14
Run Code Online (Sandbox Code Playgroud)

我希望我可以使用一个字典.像这样的东西

{'Red':'>56,'Amber':'>35 and <= 56','Yellow':'>14 and <= 35','White':'>0 and <=14'}
Run Code Online (Sandbox Code Playgroud)

我正在寻找这样的结果

        Red  Amber  Yellow  White
STRSUB  56   60     74      40
BOTDWG  20   67     87      34
Run Code Online (Sandbox Code Playgroud)

我是熊猫的新手,所以我不知道这是否可行.谁能提供一些建议.

谢谢

unu*_*tbu 26

假设您从这些数据开始:

df = pd.DataFrame({'ID': ('STRSUB BOTDWG'.split())*4,
                   'Days Late': [60, 60, 50, 50, 20, 20, 10, 10],
                   'quantity': [56, 20, 60, 67, 74, 87, 40, 34]})
#    Days Late      ID  quantity
# 0         60  STRSUB        56
# 1         60  BOTDWG        20
# 2         50  STRSUB        60
# 3         50  BOTDWG        67
# 4         20  STRSUB        74
# 5         20  BOTDWG        87
# 6         10  STRSUB        40
# 7         10  BOTDWG        34
Run Code Online (Sandbox Code Playgroud)

然后,您可以使用找到状态类别pd.cut.请注意,默认情况下,pd.cut将Series拆分df['Days Late']半开区间的类别(-1, 14], (14, 35], (35, 56], (56, 365]:

df['status'] = pd.cut(df['Days Late'], bins=[-1, 14, 35, 56, 365], labels=False)
labels = np.array('White Yellow Amber Red'.split())
df['status'] = labels[df['status']]
del df['Days Late']
print(df)
#        ID  quantity  status
# 0  STRSUB        56     Red
# 1  BOTDWG        20     Red
# 2  STRSUB        60   Amber
# 3  BOTDWG        67   Amber
# 4  STRSUB        74  Yellow
# 5  BOTDWG        87  Yellow
# 6  STRSUB        40   White
# 7  BOTDWG        34   White
Run Code Online (Sandbox Code Playgroud)

现在用于pivot获取所需形式的DataFrame:

df = df.pivot(index='ID', columns='status', values='quantity')
Run Code Online (Sandbox Code Playgroud)

并用于reindex获取行和列的所需顺序:

df = df.reindex(columns=labels[::-1], index=df.index[::-1])
Run Code Online (Sandbox Code Playgroud)

从而,

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID': ('STRSUB BOTDWG'.split())*4,
                   'Days Late': [60, 60, 50, 50, 20, 20, 10, 10],
                   'quantity': [56, 20, 60, 67, 74, 87, 40, 34]})
df['status'] = pd.cut(df['Days Late'], bins=[-1, 14, 35, 56, 365], labels=False)
labels = np.array('White Yellow Amber Red'.split())
df['status'] = labels[df['status']]
del df['Days Late']
df = df.pivot(index='ID', columns='status', values='quantity')
df = df.reindex(columns=labels[::-1], index=df.index[::-1])
print(df)
Run Code Online (Sandbox Code Playgroud)

产量

        Red  Amber  Yellow  White
ID                               
STRSUB   56     60      74     40
BOTDWG   20     67      87     34
Run Code Online (Sandbox Code Playgroud)


mta*_*add 6

您可以使用以下函数或函数DataFrame在您的Days Late列中创建一个列.我们先创建一些示例数据.mapapply

df = pandas.DataFrame({ 'ID': 'foo,bar,foo,bar,foo,bar,foo,foo'.split(','),
                        'Days Late': numpy.random.randn(8)*20+30})

   Days Late   ID
0  30.746244  foo
1  16.234267  bar
2  14.771567  foo
3  33.211626  bar
4   3.497118  foo
5  52.482879  bar
6  11.695231  foo
7  47.350269  foo
Run Code Online (Sandbox Code Playgroud)

创建一个辅助函数来转换Days Late列的数据并添加一个名为Code的列.

def days_late_xform(dl):
    if dl > 56: return 'Red'
    elif 35 < dl <= 56: return 'Amber'
    elif 14 < dl <= 35: return 'Yellow'
    elif 0 < dl <= 14: return 'White'
    else: return 'None'

df["Code"] = df['Days Late'].map(days_late_xform)

   Days Late   ID    Code
0  30.746244  foo  Yellow
1  16.234267  bar  Yellow
2  14.771567  foo  Yellow
3  33.211626  bar  Yellow
4   3.497118  foo   White
5  52.482879  bar   Amber
6  11.695231  foo   White
7  47.350269  foo   Amber
Run Code Online (Sandbox Code Playgroud)

最后,您可以使用ID代码groupby进行聚合,并按如下方式获取组的计数:

g = df.groupby(["ID","Code"]).size()
print g

ID   Code
bar  Amber     1
     Yellow    2
foo  Amber     1
     White     2     
     Yellow    2

df2 = g.unstack()
print df2

Code  Amber  White  Yellow
ID
bar       1    NaN       2
foo       1      2       2
Run Code Online (Sandbox Code Playgroud)