为什么np.where比pd.apply更快

Vik*_*ngh 7 python numpy pandas

示例代码在这里

import pandas as pd
import numpy as np

df = pd.DataFrame({'Customer' : ['Bob', 'Ken', 'Steve', 'Joe'],
                   'Spending' : [130,22,313,46]})

#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)

In [129]: %timeit df['Grade']= np.where(df['Spending'] > 100 ,'A','B')
10 loops, best of 3: 21.6 ms per loop

In [130]: %timeit df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1)
1 loop, best of 3: 7.08 s per loop
Run Code Online (Sandbox Code Playgroud)

问题来自:https://stackoverflow.com/a/41166160/3027854

jez*_*ael 10

我认为np.where更快,因为使用numpy array矢量化方式和pandas构建在这个数组上.

df.apply很慢,因为它使用loops.

vectorize操作是最快的,cython routines然后是apply.

看到这个答案,更好地解释了熊猫的开发者 - Jeff.