wat*_*wer 8 python r dataframe pandas data.table
我对Python世界相对较新,并试图将其用作进行数据分析的备份平台.我通常data.table用于我的数据分析需求.
问题是,当我在大型CSV文件上运行group-aggregate操作(随机,压缩,上传到http://www.filedropper.com/ddataredact_1)时,Python抛出:
分组pandas返回getattr(obj,方法)(*args,**kwds)ValueError:不允许负尺寸
或者(我甚至遇到过......)
文件"C:\ Anaconda3\lib\site-packages\pandas\core\reshape\util.py",第65行,在cartesian_product中为i,x在枚举(X)中文件"C:\ Anaconda3\lib\site- packages\pandas\core\reshape\util.py",第65行,in为i,x为枚举(X)]文件"C:\ Anaconda3\lib\site-packages \numpy\core\fromnumeric.py",line 445,重复返回_wrapfunc(a,'repeat',重复,轴=轴)文件"C:\ Anaconda3\lib\site-packages \numpy\core\fromnumeric.py",第51行,在_wrapfunc中返回getattr(obj ,方法)(*args,**kwds)MemoryError
我花了三天时间尝试减小文件大小(我能够将大小缩小89%),添加断点,调试它,但我无法取得任何进展.
令人惊讶的是,我想data.table在R 中运行相同的组/聚合操作,并且它几乎不需要1秒钟.此外,我没有做任何数据类型转换等,建议在https://www.dataquest.io/blog/pandas-big-data/.
我还研究了其他线程:避免大型Pandas DataFrame上的GroupBy的内存问题,Pandas:df.groupby()对于大数据集来说太慢了.任何替代方法?,和pandas groupby与大csv文件上的sum()?.似乎这些线程更多的是关于矩阵乘法.如果您不将此标记为重复,我将不胜感激.
这是我的Python代码:
finaldatapath = "..\Data_R"
ddata = pd.read_csv(finaldatapath +"\\"+"ddata_redact.csv", low_memory=False,encoding ="ISO-8859-1")
#before optimization: 353MB
ddata.info(memory_usage="deep")
#optimize file: Object-types are the biggest culprit.
ddata_obj = ddata.select_dtypes(include=['object']).copy()
#Now convert this to category type:
#Float type didn't help much, so I am excluding it here.
for col in ddata_obj:
del ddata[col]
ddata.loc[:, col] = ddata_obj[col].astype('category')
#release memory
del ddata_obj
#after optimization: 39MB
ddata.info(memory_usage="deep")
#Create a list of grouping variables:
group_column_list = [
"Business",
"Device_Family",
"Geo",
"Segment",
"Cust_Name",
"GID",
"Device ID",
"Seller",
"C9Phone_Margins_Flag",
"C9Phone_Cust_Y_N",
"ANDroid_Lic_Type",
"Type",
"Term",
'Cust_ANDroid_Margin_Bucket',
'Cust_Mobile_Margin_Bucket',
# # 'Cust_Android_App_Bucket',
'ANDroind_App_Cust_Y_N'
]
print("Analyzing data now...")
def ddata_agg(x):
names = {
'ANDroid_Margin': x['ANDroid_Margin'].sum(),
'Margins': x['Margins'].sum(),
'ANDroid_App_Qty': x['ANDroid_App_Qty'].sum(),
'Apple_Margin':x['Apple_Margin'].sum(),
'P_Lic':x['P_Lic'].sum(),
'Cust_ANDroid_Margins':x['Cust_ANDroid_Margins'].mean(),
'Cust_Mobile_Margins':x['Cust_Mobile_Margins'].mean(),
'Cust_ANDroid_App_Qty':x['Cust_ANDroid_App_Qty'].mean()
}
return pd.Series(names)
ddata=ddata.reset_index(drop=True)
ddata = ddata.groupby(group_column_list).apply(ddata_agg)
Run Code Online (Sandbox Code Playgroud)
代码在上面的.groupby操作中崩溃了.
有人可以帮帮我吗?与我的其他帖子相比,我可能花了大量时间在这篇StackOverflow帖子上,尝试修复它并学习有关Python的新内容.然而,我已经达到饱和状态 - 它让我更加沮丧,因为R's data.table包在<2秒内处理这个文件.这篇文章不是关于R和Python的优缺点,而是关于使用Python提高效率.
我完全迷失了,我会感激任何帮助.
这是我的data.table R代码:
path_r = "../ddata_redact.csv"
ddata<-data.table::fread(path_r,stringsAsFactors=FALSE,data.table = TRUE, header = TRUE)
group_column_list <-c(
"Business",
"Device_Family",
"Geo",
"Segment",
"Cust_Name",
"GID",
"Device ID",
"Seller",
"C9Phone_Margins_Flag",
"C9Phone_Cust_Y_N",
"ANDroid_Lic_Type",
"Type",
"Term",
'Cust_ANDroid_Margin_Bucket',
'Cust_Mobile_Margin_Bucket',
# # 'Cust_Android_App_Bucket',
'ANDroind_App_Cust_Y_N'
)
ddata<-ddata[, .(ANDroid_Margin = sum(ANDroid_Margin,na.rm = TRUE),
Margins=sum(Margins,na.rm = TRUE),
Apple_Margin=sum(Apple_Margin,na.rm=TRUE),
Cust_ANDroid_Margins = mean(Cust_ANDroid_Margins,na.rm = TRUE),
Cust_Mobile_Margins = mean(Cust_Mobile_Margins,na.rm = TRUE),
Cust_ANDroid_App_Qty = mean(Cust_ANDroid_App_Qty,na.rm = TRUE),
ANDroid_App_Qty=sum(ANDroid_App_Qty,na.rm = TRUE)
),
by=group_column_list]
Run Code Online (Sandbox Code Playgroud)
添加到Josemz的评论,下面是两个线程agg主场迎战apply:什么是熊猫AGG和应用功能之间的区别?和Pandas在apply()和aggregate()函数之间的区别
我认为您正在寻找的是agg而不是apply。您可以将字典映射列传递给您想要应用的函数,所以我认为这对您有用:
ddata = ddata.groupby(group_column_list).agg({
'ANDroid_Margin' : sum,
'Margins' : sum,
'ANDroid_App_Qty' : sum,
'Apple_Margin' : sum,
'P_Lic' : sum,
'Cust_ANDroid_Margins': 'mean',
'Cust_Mobile_Margins' : 'mean',
'Cust_ANDroid_App_Qty': 'mean'})
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
183 次 |
| 最近记录: |