use*_*827 2 python pivot-table dataframe pandas
cat1 cat2 col_a col_b
0 (34.0, 38.0] (15.9, 47.0] 29 10
1 (34.0, 38.0] (15.9, 47.0] 37 22
2 (28.0, 34.0] (47.0, 56.0] 3 13
3 (34.0, 38.0] (47.0, 56.0] 15 7
4 (28.0, 34.0] (56.0, 67.0] 42 20
5 (28.0, 34.0] (47.0, 56.0] 31 23
6 (28.0, 34.0] (56.0, 67.0] 26 17
7 (28.0, 34.0] (56.0, 67.0] 7 1
8 (28.0, 34.0] (56.0, 67.0] 36 19
9 (19.0, 28.0] (56.0, 67.0] 5 7
10 (19.0, 28.0] (56.0, 67.0] 21 5
11 (28.0, 34.0] (67.0, 84.0] 37 13
Run Code Online (Sandbox Code Playgroud)
在上面的数据框中,我想使用熊猫执行此数据透视表操作
pd.pivot_table(df, index='cat1', columns='cat2', values='col_a')
Run Code Online (Sandbox Code Playgroud)
但是我得到了错误:
TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'
Run Code Online (Sandbox Code Playgroud)
这两个col_a和col_b是Int32类型的,并且cat1和cat2是分类型的。我如何摆脱这个错误?
这是一个与枢轴间隔(请参见GH25814)相关的错误,将针对v0.25进行修复。另请参阅使用以下相关问题crosstab:CategoricalDType列上的Pandas交叉表会引发TypeError
同时,这里有一些选择。要进行汇总,您必须先使用pivot_table分类列并将其转换为字符串,然后再进行数据透视。
df2 = df.assign(cat1=df['cat1'].astype(str), cat2=df['cat2'].astype(str))
# to aggregate by taking the mean of col_a
df2.pivot_table(index='cat1', columns='cat2', values='col_a', aggfunc='mean')
Run Code Online (Sandbox Code Playgroud)
需要注意的是,您失去了索引和列为间隔的好处。
另一个选择是绕开分类代码,然后重新分配类别:
df2 = df.assign(cat1=df['cat1'].cat.codes, cat2=df['cat2'].cat.codes)
pivot = df2.pivot_table(
index='cat1', columns='cat2', values='col_a', aggfunc='mean')
pivot.index = df['cat1'].cat.categories
pivot.columns = df['cat2'].cat.categories
Run Code Online (Sandbox Code Playgroud)
该分配将起作用,因为会pivot_table预先对间隔进行排序。
最少的代码样本
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({
'cat1': np.random.choice(100, 10),
'cat2': np.random.choice(100, 10),
'col_a': np.random.randint(1, 50, 10)})
df['cat1'] = pd.cut(df['cat1'], bins=np.arange(0, 101, 10))
df['cat2'] = pd.cut(df['cat2'], bins=np.arange(0, 101, 10))
df
A B C
0 (40, 50] (60, 70] 18
1 (40, 50] (80, 90] 38
2 (60, 70] (80, 90] 26
3 (60, 70] (10, 20] 14
4 (60, 70] (50, 60] 9
5 (0, 10] (60, 70] 10
6 (80, 90] (30, 40] 21
7 (20, 30] (80, 90] 17
8 (30, 40] (40, 50] 6
9 (80, 90] (80, 90] 16
Run Code Online (Sandbox Code Playgroud)
(df.assign(cat1=df['cat1'].astype(str), cat2=df['cat2'].astype(str))
.pivot_table(index='cat1', columns='cat2', values='col_a', aggfunc='mean'))
cat2 (10, 20] (30, 40] (40, 50] (50, 60] (60, 70] (80, 90]
cat1
(0, 10] NaN NaN NaN NaN 10.0 NaN
(20, 30] NaN NaN NaN NaN NaN 17.0
(30, 40] NaN NaN 6.0 NaN NaN NaN
(40, 50] NaN NaN NaN NaN 18.0 38.0
(60, 70] 14.0 NaN NaN 9.0 NaN 26.0
(80, 90] NaN 21.0 NaN NaN NaN 16.0
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
149 次 |
| 最近记录: |