相关疑难解决方法(0)

在Pandas中合并索引上的数据帧更有效

为什么在Pandas上合并数据帧的索引比在列上更有效(更快)？

import pandas as pd

# Dataframes share the ID column
df = pd.DataFrame({'ID': [0, 1, 2, 3, 4],
                   'Job': ['teacher', 'scientist', 'manager', 'teacher', 'nurse']})

df2 = pd.DataFrame({'ID': [2, 3, 4, 5, 6, 7, 8],
                    'Level': [12, 15, 14, 20, 21, 11, 15], 
                    'Age': [33, 41, 42, 50, 45, 28, 32]})

Run Code Online (Sandbox Code Playgroud)

df = df.set_index('ID')
df2 = df2.set_index('ID')

Run Code Online (Sandbox Code Playgroud)

这代表了大约3.5倍的加速!(使用Pandas 0.23.0)

通过Pandas内部页面阅读它会说索引"在Cython中填充标签的位置以进行O(1)查找." 这是否意味着使用索引进行操作比使用列更有效？始终将索引用于合并等操作是最佳做法吗？

我阅读了加入和合并的文档,并没有明确提到使用索引的任何好处.

python merge dataframe pandas

wil*_*llk

lucky-day

12
推荐指数

1
解决办法

1843
查看次数

改善Pandas Merge性能

正如其他帖子所暗示的那样,我特别没有Pands Merge的性能问题,但我有一个类,其中有很多方法,它们在数据集上进行了大量的合并.

该班有大约10个小组和大约15个合并.虽然groupby相当快,但是对于类的总执行时间为1.5秒,在这15次合并调用中大约需要0.7秒.

我想加快那些合并调用的性能.因为我将有大约4000次迭代,因此在单次迭代中总共节省0.5秒将导致整体性能降低大约30分钟,这将是很好的.

我应该尝试任何建议吗？我试过:Cython Numba,Numba比较慢.

谢谢

编辑1:添加示例代码片段:我的合并语句:

tmpDf = pd.merge(self.data, t1, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t2, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t3, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t4, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t5, on='APPT_NBR', how='left')

Run Code Online (Sandbox Code Playgroud)

并且,通过实现连接,我合并了以下声明:

dat = self.data.set_index('APPT_NBR')

t1.set_index('APPT_NBR', inplace=True)
t2.set_index('APPT_NBR', inplace=True)
t3.set_index('APPT_NBR', inplace=True)
t4.set_index('APPT_NBR', inplace=True)
t5.set_index('APPT_NBR', inplace=True)

tmpDf = dat.join(t1, how='left')
tmpDf = tmpDf.join(t2, how='left')
tmpDf = tmpDf.join(t3, how='left')
tmpDf = tmpDf.join(t4, how='left') …

Run Code Online (Sandbox Code Playgroud)

python merge cython pandas numba

Deb*_*har

2016 11-30

10
推荐指数

3
解决办法

2万
查看次数