数据帧压缩后,pandas堆栈和unstack性能降低,并且比R的data.table差得多

wat*_*wer 7 python r pandas data.table

这个问题是关于提升熊猫在堆垛和卸垛作业中的表现.

问题是我有一个大型数据帧(~2GB).我按照这个博客将其压缩成~150MB成功.但是,我的堆栈和取消堆栈操作需要花费大量时间,因此我必须杀死内核并重新启动所有内容.

我也使用了R的data.table包,它只是苍蝇,这意味着它在<1秒内完成操作.

我在SO上研究了这个.看来,有些人指出,map-reduce数据帧拆散性能-大熊猫线程,但我不知道这件事的原因有两个:

  1. stack并且unstack在未压缩的运行中很好pandas,但由于内存问题,我无法在原始数据集上执行此操作.
  2. R data.table很容易(<1秒)从长格式转换为宽格式.

为了SO的代表目的,我设法削减了一个小饲料(5MB).该Feed已上传至http://www.filedropper.com/ddataredact.此文件应该能够重现该问题.

这是我的pandas代码:

import pandas as pd

#Added code to generate test data
data = {'ANDroid_Margin':{'type':'float','len':13347},
        'Name':{'type':'cat','len':71869},
        'Geo1':{'type':'cat','len':4},
        'Geo2':{'type':'cat','len':31},
        'Model':{'type':'cat','len':2}}

ddata_i = pd.DataFrame()
len_data =114348
#categorical
for colk,colv in data.items():
    print("Processing column:",colk)
    #Is the data type numeric?
    if data[colk]['type']=='cat':
        chars = string.digits + string.ascii_lowercase
        replacement_value = [
            "".join(
                [random.choice(chars) for i in range(5)]
            ) for j in range(data[colk]['len'])]

    else:
        replacement_value = np.random.uniform(
            low=0.0, high=20.0, size=(data[colk]['len'],))
    ddata_i[colk] = np.random.choice(
        replacement_value,size=len_data,replace = True)

#Unstack and Stack now. This will show the result quickly
ddata_i.groupby(["Name","Geo1","Geo2","Model"]).\
    sum().\
    unstack().\
    stack(dropna=False).\
    reset_index()

#Compress our data
ddata = ddata_i.copy()

df_obj = ddata.select_dtypes(include=['object']).copy()
for col in df_obj:
    df_obj.loc[:, col] = df_obj[col].astype('category')
ddata[df_obj.columns] = df_obj

df_obj = ddata.select_dtypes(include=['float']).copy()
for col in df_obj:
    df_obj.loc[:, col] = df_obj[col].astype('float')
ddata[df_obj.columns] = df_obj

#Let's quickly check whether compressed file is same as original file
assert ddata.shape==ddata_i.shape, "Output seems wrong"
assert ddata_i.ANDroid_Margin.sum()==ddata.ANDroid_Margin.sum(),"Sum isn't right"
for col in ["ANDroid_Margin","Name","Geo1","Geo2"]:
    assert sorted(list(ddata_i[col].unique()))==sorted(list(ddata[col].unique()))

#This will run forever
ddata.groupby(["Name","Geo1","Geo2","Model"]).\
    sum().\
    unstack().\
    stack(dropna=False).\
    reset_index()
Run Code Online (Sandbox Code Playgroud)

您将注意到堆叠和取消堆栈操作ddata_i将快速运行,但不会在压缩时运行ddata.为什么是这样?

另外,我注意到,如果我要么压缩objectfloat,然后stack()unstack()会很快运行.只有当我同时做这两件事时,问题仍然存在.

有人可以帮我理解我错过的东西吗?我怎样才能解决pandas上面的问题?我觉得有这么大的性能问题,我怎样才能编写生产就绪代码pandas?我很感激你的想法.


最后,这是R的data.table代码.我不得不说这data.table不仅快,而且我不必经历压缩和解压缩.

df <- data.table::fread("ddata_redact.csv",
                        stringsAsFactors=FALSE,
                        data.table = TRUE, 
                        header = TRUE)

df1=data.table::dcast(df, Name + Geo1 + Geo2 ~ Model, 
                      value.var = "ANDroid_Margin",
                      fun.aggregate = sum)
Run Code Online (Sandbox Code Playgroud)

有人可以帮我理解我错过的东西吗?我怎样才能解决pandas上面的问题?我觉得有这么大的性能问题,我怎样才能编写生产就绪代码pandas?我很感激你的想法.


Python的系统信息:

sys.version_info
> sys.version_info(major=3, minor=6, micro=7, releaselevel='final', serial=0)
Run Code Online (Sandbox Code Playgroud)

熊猫版

pd.__version__
> '0.23.4'
Run Code Online (Sandbox Code Playgroud)

data.table

1.11.8
Run Code Online (Sandbox Code Playgroud)

wat*_*wer 4

我找到了答案。问题是我们需要添加observed = True以防止pandas计算笛卡尔积。

压缩后,我必须运行这个......

ddata.groupby(["Name","Geo1","Geo2","Model",observed = True]).\
    sum().\
    unstack().\
    stack(dropna=False).\
    reset_index()
Run Code Online (Sandbox Code Playgroud)