Python Pandas:Series的"Reduce"功能

Question

Python Pandas:Series的"Reduce"功能

hli*_*117 22 python performance reduce vectorization pandas

是否有适用reduce于熊猫系列的模拟？

例如,对于模拟map是pd.Series.apply,但我无法找到任何模拟reduce.

我的申请是,我有一个大熊猫系列清单:

>>> business["categories"].head()

0                      ['Doctors', 'Health & Medical']
1                                        ['Nightlife']
2                 ['Active Life', 'Mini Golf', 'Golf']
3    ['Shopping', 'Home Services', 'Internet Servic...
4    ['Bars', 'American (New)', 'Nightlife', 'Loung...
Name: categories, dtype: object

Run Code Online (Sandbox Code Playgroud)

我想将这些系列列表合并在一起使用reduce,如下所示:

categories = reduce(lambda l1, l2: l1 + l2, categories)

Run Code Online (Sandbox Code Playgroud)

但这需要花费可怕的时间,因为将两个列表合并在一起就是O(n)Python的时间.我希望pd.Series有一种矢量化的方式来更快地执行此操作.

Answer 1

Mik*_*ler 19

随着`itertools.chain()`价值观

这可能会更快:

from itertools import chain
categories = list(chain.from_iterable(categories.values))

Run Code Online (Sandbox Code Playgroud)

性能

from functools import reduce
from itertools import chain

categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000)

%timeit list(chain.from_iterable(categories.values))
1000 loops, best of 3: 231 µs per loop

%timeit list(chain(*categories.values.flat))
1000 loops, best of 3: 237 µs per loop

%timeit reduce(lambda l1, l2: l1 + l2, categories)
100 loops, best of 3: 15.8 ms per loop

Run Code Online (Sandbox Code Playgroud)

对于此数据集,chain速度提高约68倍.

矢量化？

当您具有本机NumPy数据类型时,矢量化可以工作(毕竟熊猫使用NumPy作为其数据).由于我们已经在系列中列出了列表并希望得到一个列表,因此矢量化不太可能加快速度.标准Python对象和pandas/NumPy数据类型之间的转换可能会耗尽您从矢量化中获得的所有性能.我尝试在另一个答案中对算法进行矢量化.

reduce 构建了许多都需要内存分配的中间列表。分配内存很慢。使用 `chain` 可以显着减少内存分配的数量。 (2认同)

Answer 2

Mik*_*ler 7

矢量化但速度慢

\n\n

您可以使用 NumPy 的 concatenate：

\n\n

import numpy as np\n\nlist(np.concatenate(categories.values))\n

Run Code Online (Sandbox Code Playgroud)\n\n

表现

\n\n

但我们已经有了列表，即 Python 对象。因此矢量化必须在 Python 对象和 NumPy 数据类型之间来回切换。这会让事情变慢：

\n\n

categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000)\n\n%timeit list(np.concatenate(categories.values))\n100 loops, best of 3: 7.66 ms per loop\n\n%timeit np.concatenate(categories.values)\n100 loops, best of 3: 5.33 ms per loop\n\n%timeit list(chain.from_iterable(categories.values))\n1000 loops, best of 3: 231 \xc2\xb5s per loop\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	9 年，7 月前
查看次数：	11560 次
最近记录：	6 年，4 月前

Python Pandas:Series的"Reduce"功能

随着itertools.chain()价值观

性能

矢量化？

矢量化但速度慢

表现

随着`itertools.chain()`价值观