为什么pandas.series.map如此惊人地慢？

Question

为什么pandas.series.map如此惊人地慢？

有时候我只是讨厌使用中间件。以这个为例：我想拥有一个查找表，该表将一组输入（域）值中的值映射到输出（范围）值中。映射是唯一的。Python映射可以做到这一点，但是由于我认为该映射很大，所以为什么不使用ps.Series及其索引，它可以带来更多好处：

传递多个值以映射为一个序列（希望比字典查找更快）
结果中保留原始序列的索引

像这样：

domain2range = pd.Series(allrangevals, index=alldomainvals)
# Apply the map
query_vals = pd.Series(domainvals, index=someindex)
result = query_vals.map(domain2range)
assert result.index is someindex # Nice
assert (result.values in allrangevals).all() # Nice

Run Code Online (Sandbox Code Playgroud)

可以正常工作。但不是。上面的.map的时间成本len(domain2range)没有（更明智地）增加O(len(query_vals))，如下所示：

numiter = 100
for n in [10, 1000, 1000000, 10000000,]:
    domain = np.arange(0, n)
    range = domain+10
    maptable = pd.Series(range, index=domain).sort_index()

    query_vals = pd.Series([1,2,3])
    def f():
        query_vals.map(maptable)
    print n, timeit.timeit(stmt=f, number=numiter)/numiter


10 0.000630810260773
1000 0.000978469848633
1000000 0.00130645036697
10000000 0.0162791204453

Run Code Online (Sandbox Code Playgroud)

脸庞。在n = 10000000时，每个映射值占用（0.01 / 3）秒。

所以，问题：

被Series.map预计这样的表现？为什么它如此彻底，荒谬地缓慢？我认为我正在使用它，如文档所示。
有没有一种快速的方法可以使用熊猫进行表格查找。好像不是上面的吗？

Answer 1

use*_*956 3

https://github.com/pandas-dev/pandas/issues/21278

热身是问题所在。（双掌）。Pandas 在第一次使用时默默地构建并缓存哈希索引 (O(maplen))。调用测试函数并预构建索引可以获得更好的性能。

numiter = 100
for n in [10, 100000, 1000000, 10000000,]:
    domain = np.arange(0, n)
    range = domain+10
    maptable = pd.Series(range, index=domain) #.sort_index()

    query_vals = pd.Series([1,2,3])

    def f1():
        query_vals.map(maptable)
    f1()
    print "Pandas1 ", n, timeit.timeit(stmt=f1, number=numiter)/numiter

    def f2():
        query_vals.map(maptable.get)
    f2()
    print "Pandas2 ", n, timeit.timeit(stmt=f2, number=numiter)/numiter

    maptabledict = maptable.to_dict()
    query_vals_list = pd.Series([1,2,3]).tolist()

    def f3():
        {k: maptabledict[k] for k in query_vals_list}
    f3()
    print "Py dict ", n, timeit.timeit(stmt=f3, number=numiter)/numiter
    print

pd.show_versions()
Pandas1  10 0.000621199607849
Pandas2  10 0.000686831474304
Py dict  10 2.0170211792e-05

Pandas1  100000 0.00149286031723
Pandas2  100000 0.00118808984756
Py dict  100000 8.47816467285e-06

Pandas1  1000000 0.000708899497986
Pandas2  1000000 0.000479419231415
Py dict  1000000 1.64794921875e-05

Pandas1  10000000 0.000798969268799
Pandas2  10000000 0.000410139560699
Py dict  10000000 1.47914886475e-05

Run Code Online (Sandbox Code Playgroud)

...虽然有点令人沮丧，Python 字典快了 10 倍。

归档时间：	7 年，7 月前
查看次数：	561 次
最近记录：	7 年，7 月前