如何获得与R一样的Pandas数据帧的类似摘要？

Question

如何获得与R一样的Pandas数据帧的类似摘要？

不同的尺度允许不同类型的操作.我想指定数据框中列的比例df.然后,df.describe()应该考虑到这一点.

例子

标称比例:名义比例仅允许检查等效性.这方面的例子是性别,姓名,城市名称.您基本上只能计算它们出现的频率并给出最常见的(模式).
顺序尺度:你可以订购,但不能说一个人离另一个人有多远.布料尺寸是一个例子.您可以计算此比例的中位数/分钟/最大值.
定量尺度:您可以计算这些尺度的平均值,标准偏差,分位数.

代码示例

import pandas as pd
import pandas.rpy.common as rcom
df = rcom.load_data('mtcars')
print(df.describe())

Run Code Online (Sandbox Code Playgroud)

给

             mpg        cyl        disp          hp       drat         wt  \
count  32.000000  32.000000   32.000000   32.000000  32.000000  32.000000   
mean   20.090625   6.187500  230.721875  146.687500   3.596563   3.217250   
std     6.026948   1.785922  123.938694   68.562868   0.534679   0.978457   
min    10.400000   4.000000   71.100000   52.000000   2.760000   1.513000   
25%    15.425000   4.000000  120.825000   96.500000   3.080000   2.581250   
50%    19.200000   6.000000  196.300000  123.000000   3.695000   3.325000   
75%    22.800000   8.000000  326.000000  180.000000   3.920000   3.610000   
max    33.900000   8.000000  472.000000  335.000000   4.930000   5.424000   

            qsec         vs         am       gear     carb  
count  32.000000  32.000000  32.000000  32.000000  32.0000  
mean   17.848750   0.437500   0.406250   3.687500   2.8125  
std     1.786943   0.504016   0.498991   0.737804   1.6152  
min    14.500000   0.000000   0.000000   3.000000   1.0000  
25%    16.892500   0.000000   0.000000   3.000000   2.0000  
50%    17.710000   0.000000   0.000000   4.000000   2.0000  
75%    18.900000   1.000000   1.000000   4.000000   4.0000  
max    22.900000   1.000000   1.000000   5.000000   8.0000

Run Code Online (Sandbox Code Playgroud)

这不是一个好的,因为vs二进制变量指示汽车是否具有V引擎或直引擎(源).因此,该特征具有标称规模.因此min/max/std/mean不适用.应该计算0和1出现的频率.

在R中,您可以执行以下操作:

mtcars$vs = factor(mtcars$vs, levels=c(0, 1), labels=c("straight engine", "V-Engine"))
mtcars$am = factor(mtcars$am, levels=c(0, 1), labels=c("Automatic", "Manual"))
mtcars$gear = factor(mtcars$gear)
mtcars$carb = factor(mtcars$carb)
summary(mtcars)

Run Code Online (Sandbox Code Playgroud)

得到

      mpg             cyl             disp             hp             drat      
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0   Median :3.695  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7   Mean   :3.597  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930  
       wt             qsec                     vs             am     gear   carb  
 Min.   :1.513   Min.   :14.50   straight engine:18   Automatic:19   3:15   1: 7  
 1st Qu.:2.581   1st Qu.:16.89   V-Engine       :14   Manual   :13   4:12   2:10  
 Median :3.325   Median :17.71                                       5: 5   3: 3  
 Mean   :3.217   Mean   :17.85                                              4:10  
 3rd Qu.:3.610   3rd Qu.:18.90                                              6: 1  
 Max.   :5.424   Max.   :22.90                                              8: 1

Run Code Online (Sandbox Code Playgroud)

熊猫也有类似的东西吗？

我试过了

df["vs"] = df["vs"].astype('category')

Run Code Online (Sandbox Code Playgroud)

但这使得"vs"从描述中消失了.

Answer 1

ves*_*and 2

聚会迟到了，但我最近碰巧一直在努力解决一些同样的问题，所以我想分享一下我对这一挑战的看法。

在我看来，R 还是更擅长处理分类变量。然而，您可以通过几种方法使用 Python 和,pd.Categorical()来模拟其中一些功能。pd.GetDummies()describe()

这个特定数据集中的挑战是分类变量具有非常不同的属性。例如am is 0 or 1分别用于自动或手动齿轮。和gear is either 3, 4, or 5，但仍然最合理地视为分类值而不是数值。因此，am我会将 0 和 1 替换为“自动”和“分类”，但对于装备，我会pd.GetDummies()为每个类别的装备获取 0 或 1，以便能够轻松计算有多少型号，例如， 3 个齿轮。

我有一个实用函数已经存在了一段时间，昨天我对其进行了一些改进。它肯定不是最优雅的，但它应该为您提供与使用 R 代码片段相同的信息。最终的输出表由行数不等的列组成。我没有制作一个类似的表作为数据框并用 NaN 填充它，而是将信息分成两部分：一张表用于数值，一张表用于分类值，因此您最终会得到以下结果：

                 count
Straight Engine     18
V engine            14
automatic           13
manual              19
cyl_4               11
cyl_6                7
cyl_8               14
gear_3              15
gear_4              12
gear_5               5
carb_1               7
carb_2              10
carb_3               3
carb_4              10
carb_6               1
carb_8               1
             mpg        disp          hp       drat         wt       qsec
count  32.000000   32.000000   32.000000  32.000000  32.000000  32.000000
mean   20.090625  230.721875  146.687500   3.596563   3.217250  17.848750
std     6.026948  123.938694   68.562868   0.534679   0.978457   1.786943
min    10.400000   71.100000   52.000000   2.760000   1.513000  14.500000
25%    15.425000  120.825000   96.500000   3.080000   2.581250  16.892500
50%    19.200000  196.300000  123.000000   3.695000   3.325000  17.710000
75%    22.800000  326.000000  180.000000   3.920000   3.610000  18.900000
max    33.900000  472.000000  335.000000   4.930000   5.424000  22.900000

Run Code Online (Sandbox Code Playgroud)

以下是轻松复制和粘贴的整个过程：

# imports
import pandas as pd

# to easily access R datasets:
# pip install pydataset
from pydataset import data 

# Load dataset
df_mtcars = data('mtcars')


# The following variables: cat, dum, num and recoding
# are used in the function describeCat/df, dummies, recode, categorical) below

# Specify which variables are dummy variables [0 or 1], 
# ategorical [multiple categories] or numeric
cat = ['cyl', 'gear', 'carb']
dum = ['vs', 'am']
num = [c for c in list(df_mtcars) if c not in cat+dum]

# Also, define a dictionary that describes how some dummy variables should be recoded
# For example, in the series am, 0 is recoded as automatic and 1 as manual gears
recoding = {'am':['manual', 'automatic'], 'vs':['Straight Engine', 'V engine']}

# The function:
def describeCat(df, dummies, recode, categorical):
    """ Retrieves specified dummy and categorical variables
        from a pandas DataFrame and describes them (just count for now).

        Dummy variables [0 or 1] can be recoded to categorical variables
        by specifying a dictionary

    Keyword arguments:
    df -- pandas DataFrame
    dummies -- list of column names to specify dummy variables [0 or 1]
    recode -- dictionary to specify which and how dummyvariables should be recoded
    categorical -- list of columns names to specify catgorical variables

    """


    # Recode dummy variables
    recoded = []

    # DataFrame to store recoded variables
    df_recoded = pd.DataFrame()

    for dummy in dummies:
        if dummy in recode.keys():

            dummySeries = df[dummy].copy(deep = True).to_frame()
            dummySeries[dummy][dummySeries[dummy] == 0] = recode[dummy][0]
            dummySeries[dummy][dummySeries[dummy] == 1] = recode[dummy][1]
            recoded.append(pd.Categorical(dummySeries[dummy]).describe())  

            df_rec = pd.DataFrame(pd.Categorical(dummySeries[dummy]).describe())
            df_recoded = pd.concat([df_recoded.reset_index(),df_rec.reset_index()],
                                    ignore_index=True).set_index('categories')

    df_recoded = df_recoded['counts'].to_frame()

    # Rename columns and change datatype
    df_recoded['counts'] = df_recoded['counts'].astype(int)
    df_recoded.columns = ['count']


    # Since categorical variables will be transformed into dummy variables,
    # all remaining dummy variables (after recoding) can be treated the
    # same way as the categorical variables
    unrecoded = [var for var in dum if var not in recoding.keys()]
    categorical = categorical + unrecoded

    # Categorical split into dummy variables will have the same index
    # as the original dataframe
    allCats = pd.DataFrame(index = df.index)

    # apply pd.get_dummies on all categoirical variables
    for cat in categorical:
        newCats = pd.DataFrame(data = pd.get_dummies(pd.Categorical(df_mtcars[cat]), prefix = cat))
        newCats.index = df_mtcars.index
        allCats = pd.concat([allCats, newCats], axis = 1)
        df_cat = allCats.sum().to_frame()
    df_cat.columns = ['count']

    # gather output dataframes
    df_output = pd.concat([df_recoded, df_cat], axis = 0)


    return(df_output)

# Test run: Build a dataframe that describes the dummy and categorical variables
df_categorical = describeCat(df = df_mtcars, dummies = dum, recode = recoding, categorical = cat)

# describe numerical variables
df_numerical = df_mtcars[num].describe()

print(df_categorical)
print(df_numerical)

Run Code Online (Sandbox Code Playgroud)

关于分类变量和describe()的旁注：

我在上面的函数中使用的原因pd.Categorical()是输出describe()似乎有些不稳定。有时df_mtcars['gear'].astype('category').describe()返回：

count    32.000000
mean      3.687500
std       0.737804
min       3.000000
25%       3.000000
50%       4.000000
75%       4.000000
max       5.000000
Name: gear, dtype: float64

Run Code Online (Sandbox Code Playgroud)

虽然它应该返回：考虑到它被视为分类变量：

count     32
unique     3
top        3
freq      15
Name: gear, dtype: int64

Run Code Online (Sandbox Code Playgroud)

我在这里可能是错的，并且我在重现该问题时遇到问题，但我可以发誓这种情况时常发生。

describe()在 a 上使用pd.Categorical()给出了它自己的格式的输出，但至少它看起来是稳定的。

            counts    freqs
categories                 
3               15  0.46875
4               12  0.37500
5                5  0.15625

Run Code Online (Sandbox Code Playgroud)

关于pd.get_dummies()的最后几句话

当您将该函数应用于时会发生以下情况df_mtcars['gear']：

# code pd.get_dummies(df_mtcars['gear'].astype('category'), prefix = 'gear') # output gear_3 gear_4 gear_5 Mazda RX4 0 1 0 Mazda RX4 Wag 0 1 0 Datsun 710 0 1 0 Hornet 4 Drive 1 0 0 Hornet Sportabout 1 0 0 Valiant 1 0 0 . . . Ferrari Dino 0 0 1 Maserati Bora 0 0 1 Volvo 142E 0 1 0
Run Code Online (Sandbox Code Playgroud)
但在这种情况下，我会简单地使用，value_counts()以便您得到以下结果：

counts freqs categories 3 15 0.46875 4 12 0.37500 5 5 0.15625
Run Code Online (Sandbox Code Playgroud)
这也恰好类似于使用变量的describe()输出pd.Categorical()。

归档时间：	9 年，6 月前
查看次数：	3141 次
最近记录：	6 年，12 月前