熊猫取代/字典慢

Question

熊猫取代/字典慢

Att*_*nen 6 python performance dictionary pandas

请帮助我理解为什么Python/Pandas中的"替换字典"操作很慢:

# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)

Run Code Online (Sandbox Code Playgroud)

字典查找应为O(1).替换列中的值应为O(1).这不是矢量化操作吗？即使它没有矢量化,迭代200行只有200次迭代,那么它怎么会变慢呢？

以下是SSCCE演示此问题:

import pandas as pd
import random

# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
    dictionary[x] = 'Some string ' + str(x)
for x in range(200):
    orig.append(random.randint(1, 11269))
series = pd.Series(orig)

# The actual operation we care about
print('Starting...')
series.replace(dictionary, inplace=True)
print('Done.')

Run Code Online (Sandbox Code Playgroud)

在我的机器上运行该命令需要1秒以上的时间,这比执行<1000次操作的时间长1000倍.

Answer 1

roo*_*oot 14

它看起来replace有点开销,并且明确告诉系列通过什么方式可以map产生最佳性能:

series = series.map(lambda x: dictionary.get(x,x))

Run Code Online (Sandbox Code Playgroud)

如果您确定所有键都在您的字典中,那么通过不创建lambda并直接提供该dictionary.get函数,可以获得非常轻微的性能提升.任何不存在的键都将NaN通过此方法返回,因此请注意:

series = series.map(dictionary.get)

Run Code Online (Sandbox Code Playgroud)

你也可以只提供字典本身,但这似乎会引入一些开销:

series = series.map(dictionary)

Run Code Online (Sandbox Code Playgroud)

计时

使用示例数据进行一些时序比较:

%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop

%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop

%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop

%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop

Run Code Online (Sandbox Code Playgroud)

`.replace` 可以进行不完整的子字符串匹配，而 `.map` 需要在字典中提供完整的值 (2认同)

归档时间：	8 年，11 月前
查看次数：	2183 次
最近记录：	8 年，11 月前