C. *_*ale 11 python pivot numpy dataframe pandas
我写了一个网络刮刀,从产品表中提取信息并构建数据框.数据表有一个Description列,其中包含描述产品的逗号分隔的属性字符串.我想在数据框中为每个唯一属性创建一个列,并使用属性的子字符串填充该列中的行.示例df如下.
PRODUCTS DATE DESCRIPTION
Product A 2016-9-12 Steel, Red, High Hardness
Product B 2016-9-11 Blue, Lightweight, Steel
Product C 2016-9-12 Red
Run Code Online (Sandbox Code Playgroud)
我想第一步是将描述拆分成一个列表.
In: df2 = df['DESCRIPTION'].str.split(',')
Out:
DESCRIPTION
['Steel', 'Red', 'High Hardness']
['Blue', 'Lightweight', 'Steel']
['Red']
Run Code Online (Sandbox Code Playgroud)
我想要的输出如下表所示.列名不是特别重要.
PRODUCTS DATE STEEL_COL RED_COL HIGH HARDNESS_COL BLUE COL LIGHTWEIGHT_COL
Product A 2016-9-12 Steel Red High Hardness
Product B 2016-9-11 Steel Blue Lightweight
Product C 2016-9-12 Red
Run Code Online (Sandbox Code Playgroud)
我相信可以使用Pivot设置列,但我不确定建立它们后填充列的最Pythonic方法.任何帮助表示赞赏.
非常感谢您的回答.我选择@ MaxU的回答是正确的,因为它似乎稍微灵活一点,但@ piRSquared得到了一个非常相似的结果,甚至可能被认为是更多的Pythonic方法.我测试了两个版本,都做了我需要的.谢谢!
你可以建立一个稀疏矩阵:
In [27]: df
Out[27]:
PRODUCTS DATE DESCRIPTION
0 Product A 2016-9-12 Steel, Red, High Hardness
1 Product B 2016-9-11 Blue, Lightweight, Steel
2 Product C 2016-9-12 Red
In [28]: (df.set_index(['PRODUCTS','DATE'])
....: .DESCRIPTION.str.split(',\s*', expand=True)
....: .stack()
....: .reset_index()
....: .pivot_table(index=['PRODUCTS','DATE'], columns=0, fill_value=0, aggfunc='size')
....: )
Out[28]:
0 Blue High Hardness Lightweight Red Steel
PRODUCTS DATE
Product A 2016-9-12 0 1 0 1 1
Product B 2016-9-11 1 0 1 0 1
Product C 2016-9-12 0 0 0 1 0
In [29]: (df.set_index(['PRODUCTS','DATE'])
....: .DESCRIPTION.str.split(',\s*', expand=True)
....: .stack()
....: .reset_index()
....: .pivot_table(index=['PRODUCTS','DATE'], columns=0, fill_value='', aggfunc='size')
....: )
Out[29]:
0 Blue High Hardness Lightweight Red Steel
PRODUCTS DATE
Product A 2016-9-12 1 1 1
Product B 2016-9-11 1 1 1
Product C 2016-9-12 1
Run Code Online (Sandbox Code Playgroud)
cols = ['PRODUCTS', 'DATE']
pd.get_dummies(
df.set_index(cols).DESCRIPTION \
.str.split(',\s*', expand=True).stack()
).groupby(level=cols).sum().astype(int)
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
2441 次 |
最近记录: |