将熊猫数据透视表与间隔列一起使用会导致TypeError

use*_*827 2 python pivot-table dataframe pandas

      cat1             cat2                       col_a             col_b
0    (34.0, 38.0]    (15.9, 47.0]             29               10
1    (34.0, 38.0]    (15.9, 47.0]             37               22
2    (28.0, 34.0]    (47.0, 56.0]              3               13
3    (34.0, 38.0]    (47.0, 56.0]             15                7
4    (28.0, 34.0]    (56.0, 67.0]             42               20
5    (28.0, 34.0]    (47.0, 56.0]             31               23
6    (28.0, 34.0]    (56.0, 67.0]             26               17
7    (28.0, 34.0]    (56.0, 67.0]              7                1
8    (28.0, 34.0]    (56.0, 67.0]             36               19
9    (19.0, 28.0]    (56.0, 67.0]              5                7
10   (19.0, 28.0]    (56.0, 67.0]             21                5
11   (28.0, 34.0]    (67.0, 84.0]             37               13
Run Code Online (Sandbox Code Playgroud)

在上面的数据框中,我想使用熊猫执行此数据透视表操作

pd.pivot_table(df, index='cat1', columns='cat2', values='col_a')
Run Code Online (Sandbox Code Playgroud)

但是我得到了错误:

TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'
Run Code Online (Sandbox Code Playgroud)

这两个col_acol_b是Int32类型的,并且cat1cat2是分类型的。我如何摆脱这个错误?

cs9*_*s95 5

这是一个与枢轴间隔(请参见GH25814)相关的错误,将针对v0.25进行修复。另请参阅使用以下相关问题crosstabCategoricalDType列上的Pandas交叉表会引发TypeError

同时,这里有一些选择。要进行汇总,您必须先使用pivot_table分类列并将其转换为字符串,然后再进行数据透视。

df2 = df.assign(cat1=df['cat1'].astype(str), cat2=df['cat2'].astype(str))
# to aggregate by taking the mean of col_a
df2.pivot_table(index='cat1', columns='cat2', values='col_a', aggfunc='mean')
Run Code Online (Sandbox Code Playgroud)

需要注意的是,您失去了索引和列为间隔的好处。

另一个选择是绕开分类代码,然后重新分配类别:

df2 = df.assign(cat1=df['cat1'].cat.codes, cat2=df['cat2'].cat.codes)
pivot = df2.pivot_table(
    index='cat1', columns='cat2', values='col_a', aggfunc='mean')

pivot.index = df['cat1'].cat.categories
pivot.columns = df['cat2'].cat.categories
Run Code Online (Sandbox Code Playgroud)

该分配将起作用,因为会pivot_table预先对间隔进行排序。


最少的代码样本

import pandas as pd
import numpy as np

np.random.seed(0)

df = pd.DataFrame({
    'cat1': np.random.choice(100, 10), 
    'cat2': np.random.choice(100, 10), 
    'col_a': np.random.randint(1, 50, 10)})

df['cat1'] = pd.cut(df['cat1'], bins=np.arange(0, 101, 10))
df['cat2'] = pd.cut(df['cat2'], bins=np.arange(0, 101, 10))

df
          A         B   C
0  (40, 50]  (60, 70]  18
1  (40, 50]  (80, 90]  38
2  (60, 70]  (80, 90]  26
3  (60, 70]  (10, 20]  14
4  (60, 70]  (50, 60]   9
5   (0, 10]  (60, 70]  10
6  (80, 90]  (30, 40]  21
7  (20, 30]  (80, 90]  17
8  (30, 40]  (40, 50]   6
9  (80, 90]  (80, 90]  16
Run Code Online (Sandbox Code Playgroud)

(df.assign(cat1=df['cat1'].astype(str), cat2=df['cat2'].astype(str))
   .pivot_table(index='cat1', columns='cat2', values='col_a', aggfunc='mean'))

cat2      (10, 20]  (30, 40]  (40, 50]  (50, 60]  (60, 70]  (80, 90]
cat1                                                                
(0, 10]        NaN       NaN       NaN       NaN      10.0       NaN
(20, 30]       NaN       NaN       NaN       NaN       NaN      17.0
(30, 40]       NaN       NaN       6.0       NaN       NaN       NaN
(40, 50]       NaN       NaN       NaN       NaN      18.0      38.0
(60, 70]      14.0       NaN       NaN       9.0       NaN      26.0
(80, 90]       NaN      21.0       NaN       NaN       NaN      16.0
Run Code Online (Sandbox Code Playgroud)