Cha*_*les 26 python string performance python-internals pandas
我有一个pandas.Series包含整数,但我需要将这些转换为字符串以用于某些下游工具.所以假设我有一个Series对象:
import numpy as np
import pandas as pd
x = pd.Series(np.random.randint(0, 100, 1000000))
Run Code Online (Sandbox Code Playgroud)
在StackOverflow和其他网站上,我见过大多数人认为最好的方法是:
%% timeit
x = x.astype(str)
Run Code Online (Sandbox Code Playgroud)
这大约需要2秒钟.
使用时x = x.apply(str),只需0.2秒.
为什么x.astype(str)这么慢?应该推荐的方式x.apply(str)吗?
我主要对python 3的行为感兴趣.
MSe*_*ert 22
让我们从一些一般建议开始:如果您有兴趣找到Python代码的瓶颈,您可以使用分析器来查找大部分时间吃掉的功能/部件.在这种情况下,我使用行分析器,因为您实际上可以看到实现和每行所花费的时间.
但是,默认情况下,这些工具不适用于C或Cython.鉴于CPython(我正在使用的是Python解释器),NumPy和pandas大量使用C和Cython,我将在一定程度上限制分析.
Actually: one probably could extend profiling to the Cython code and probably also the C code by recompiling it with debug symbols and tracing, however it's not an easy task to compile these libraries so I won't do that (but if someone likes to do that the Cython documentation includes a page about profiling Cython code).
But let's see how far I can get:
I'm going to use line-profiler and a Jupyter Notebook here:
%load_ext line_profiler
import numpy as np
import pandas as pd
x = pd.Series(np.random.randint(0, 100, 100000))
Run Code Online (Sandbox Code Playgroud)
x.astype%lprun -f x.astype x.astype(str)
Run Code Online (Sandbox Code Playgroud)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
87 @wraps(func)
88 def wrapper(*args, **kwargs):
89 1 12 12.0 0.0 old_arg_value = kwargs.pop(old_arg_name, None)
90 1 5 5.0 0.0 if old_arg_value is not None:
91 if mapping is not None:
...
118 1 663354 663354.0 100.0 return func(*args, **kwargs)
Run Code Online (Sandbox Code Playgroud)
So that's simply a decorator and 100% of the time is spent in the decorated function. So let's profile the decorated function:
%lprun -f x.astype.__wrapped__ x.astype(str)
Run Code Online (Sandbox Code Playgroud)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3896 @deprecate_kwarg(old_arg_name='raise_on_error', new_arg_name='errors',
3897 mapping={True: 'raise', False: 'ignore'})
3898 def astype(self, dtype, copy=True, errors='raise', **kwargs):
3899 """
...
3975 """
3976 1 28 28.0 0.0 if is_dict_like(dtype):
3977 if self.ndim == 1: # i.e. Series
...
4001
4002 # else, only a single dtype is given
4003 1 14 14.0 0.0 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
4004 1 685863 685863.0 99.9 **kwargs)
4005 1 340 340.0 0.0 return self._constructor(new_data).__finalize__(self)
Run Code Online (Sandbox Code Playgroud)
Again one line is the bottleneck so let's check the _data.astype method:
%lprun -f x._data.astype x.astype(str)
Run Code Online (Sandbox Code Playgroud)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3461 def astype(self, dtype, **kwargs):
3462 1 695866 695866.0 100.0 return self.apply('astype', dtype=dtype, **kwargs)
Run Code Online (Sandbox Code Playgroud)
Okay, another delegate, let's see what _data.apply does:
%lprun -f x._data.apply x.astype(str)
Run Code Online (Sandbox Code Playgroud)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3251 def apply(self, f, axes=None, filter=None, do_integrity_check=False,
3252 consolidate=True, **kwargs):
3253 """
...
3271 """
3272
3273 1 12 12.0 0.0 result_blocks = []
...
3309
3310 1 10 10.0 0.0 aligned_args = dict((k, kwargs[k])
3311 1 29 29.0 0.0 for k in align_keys
3312 if hasattr(kwargs[k], 'reindex_axis'))
3313
3314 2 28 14.0 0.0 for b in self.blocks:
...
3329 1 674974 674974.0 100.0 applied = getattr(b, f)(**kwargs)
3330 1 30 30.0 0.0 result_blocks = _extend_blocks(applied, result_blocks)
3331
3332 1 10 10.0 0.0 if len(result_blocks) == 0:
3333 return self.make_empty(axes or self.axes)
3334 1 10 10.0 0.0 bm = self.__class__(result_blocks, axes or self.axes,
3335 1 76 76.0 0.0 do_integrity_check=do_integrity_check)
3336 1 13 13.0 0.0 bm._consolidate_inplace()
3337 1 7 7.0 0.0 return bm
Run Code Online (Sandbox Code Playgroud)
而且......一个函数调用一直在进行,这次是x._data.blocks[0].astype:
%lprun -f x._data.blocks[0].astype x.astype(str)
Run Code Online (Sandbox Code Playgroud)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
542 def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
543 1 18 18.0 0.0 return self._astype(dtype, copy=copy, errors=errors, values=values,
544 1 671092 671092.0 100.0 **kwargs)
Run Code Online (Sandbox Code Playgroud)
..这是另一位代表......
%lprun -f x._data.blocks[0]._astype x.astype(str)
Run Code Online (Sandbox Code Playgroud)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
546 def _astype(self, dtype, copy=False, errors='raise', values=None,
547 klass=None, mgr=None, **kwargs):
548 """
...
557 """
558 1 11 11.0 0.0 errors_legal_values = ('raise', 'ignore')
559
560 1 8 8.0 0.0 if errors not in errors_legal_values:
561 invalid_arg = ("Expected value of kwarg 'errors' to be one of {}. "
562 "Supplied value is '{}'".format(
563 list(errors_legal_values), errors))
564 raise ValueError(invalid_arg)
565
566 1 23 23.0 0.0 if inspect.isclass(dtype) and issubclass(dtype, ExtensionDtype):
567 msg = ("Expected an instance of {}, but got the class instead. "
568 "Try instantiating 'dtype'.".format(dtype.__name__))
569 raise TypeError(msg)
570
571 # may need to convert to categorical
572 # this is only called for non-categoricals
573 1 72 72.0 0.0 if self.is_categorical_astype(dtype):
...
595
596 # astype processing
597 1 16 16.0 0.0 dtype = np.dtype(dtype)
598 1 19 19.0 0.0 if self.dtype == dtype:
...
603 1 8 8.0 0.0 if klass is None:
604 1 13 13.0 0.0 if dtype == np.object_:
605 klass = ObjectBlock
606 1 6 6.0 0.0 try:
607 # force the copy here
608 1 7 7.0 0.0 if values is None:
609
610 1 8 8.0 0.0 if issubclass(dtype.type,
611 1 14 14.0 0.0 (compat.text_type, compat.string_types)):
612
613 # use native type formatting for datetime/tz/timedelta
614 1 15 15.0 0.0 if self.is_datelike:
615 values = self.to_native_types()
616
617 # astype formatting
618 else:
619 1 8 8.0 0.0 values = self.values
620
621 else:
622 values = self.get_values(dtype=dtype)
623
624 # _astype_nansafe works fine with 1-d only
625 1 665777 665777.0 99.9 values = astype_nansafe(values.ravel(), dtype, copy=True)
626 1 32 32.0 0.0 values = values.reshape(self.shape)
627
628 1 17 17.0 0.0 newb = make_block(values, placement=self.mgr_locs, dtype=dtype,
629 1 269 269.0 0.0 klass=klass)
630 except:
631 if errors == 'raise':
632 raise
633 newb = self.copy() if copy else self
634
635 1 8 8.0 0.0 if newb.is_numeric and self.is_numeric:
...
642 1 6 6.0 0.0 return newb
Run Code Online (Sandbox Code Playgroud)
......好吧,还没有.我们来看看astype_nansafe:
%lprun -f pd.core.internals.astype_nansafe x.astype(str)
Run Code Online (Sandbox Code Playgroud)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
640 def astype_nansafe(arr, dtype, copy=True):
641 """ return a view if copy is False, but
642 need to be very careful as the result shape could change! """
643 1 13 13.0 0.0 if not isinstance(dtype, np.dtype):
644 dtype = pandas_dtype(dtype)
645
646 1 8 8.0 0.0 if issubclass(dtype.type, text_type):
647 # in Py3 that's str, in Py2 that's unicode
648 1 663317 663317.0 100.0 return lib.astype_unicode(arr.ravel()).reshape(arr.shape)
...
Run Code Online (Sandbox Code Playgroud)
再一次,它是100行的一行,所以我将进一步发挥一个功能:
%lprun -f pd.core.dtypes.cast.lib.astype_unicode x.astype(str)
UserWarning: Could not extract a code object for the object <built-in function astype_unicode>
Run Code Online (Sandbox Code Playgroud)
好的,我们发现了一个built-in function,这意味着它是一个C函数.在这种情况下,它是一个Cython函数.但这意味着我们无法通过line-profiler深入挖掘.所以我现在就到此为止.
x.apply%lprun -f x.apply x.apply(str)
Run Code Online (Sandbox Code Playgroud)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2426 def apply(self, func, convert_dtype=True, args=(), **kwds):
2427 """
...
2523 """
2524 1 84 84.0 0.0 if len(self) == 0:
2525 return self._constructor(dtype=self.dtype,
2526 index=self.index).__finalize__(self)
2527
2528 # dispatch to agg
2529 1 11 11.0 0.0 if isinstance(func, (list, dict)):
2530 return self.aggregate(func, *args, **kwds)
2531
2532 # if we are a string, try to dispatch
2533 1 12 12.0 0.0 if isinstance(func, compat.string_types):
2534 return self._try_aggregate_string_function(func, *args, **kwds)
2535
2536 # handle ufuncs and lambdas
2537 1 7 7.0 0.0 if kwds or args and not isinstance(func, np.ufunc):
2538 f = lambda x: func(x, *args, **kwds)
2539 else:
2540 1 6 6.0 0.0 f = func
2541
2542 1 154 154.0 0.1 with np.errstate(all='ignore'):
2543 1 11 11.0 0.0 if isinstance(f, np.ufunc):
2544 return f(self)
2545
2546 # row-wise access
2547 1 188 188.0 0.1 if is_extension_type(self.dtype):
2548 mapped = self._values.map(f)
2549 else:
2550 1 6238 6238.0 3.3 values = self.asobject
2551 1 181910 181910.0 95.5 mapped = lib.map_infer(values, f, convert=convert_dtype)
2552
2553 1 28 28.0 0.0 if len(mapped) and isinstance(mapped[0], Series):
2554 from pandas.core.frame import DataFrame
2555 return DataFrame(mapped.tolist(), index=self.index)
2556 else:
2557 1 19 19.0 0.0 return self._constructor(mapped,
2558 1 1870 1870.0 1.0 index=self.index).__finalize__(self)
Run Code Online (Sandbox Code Playgroud)
同样,这是一个大部分时间都需要的功能:lib.map_infer......
%lprun -f pd.core.series.lib.map_infer x.apply(str)
Run Code Online (Sandbox Code Playgroud)
Could not extract a code object for the object <built-in function map_infer>
Run Code Online (Sandbox Code Playgroud)
好的,这是另一个Cython功能.
这时候还有另外一个(虽然不太显著)与贡献者〜3%: values = self.asobject.但我暂时忽略了这一点,因为我们对主要贡献者感兴趣.
astypeThis is the astype_unicode function:
cpdef ndarray[object] astype_unicode(ndarray arr):
cdef:
Py_ssize_t i, n = arr.size
ndarray[object] result = np.empty(n, dtype=object)
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, unicode(arr[i]))
return result
Run Code Online (Sandbox Code Playgroud)
This function uses this helper:
cdef inline set_value_at_unsafe(ndarray arr, object loc, object value):
cdef:
Py_ssize_t i, sz
if is_float_object(loc):
casted = int(loc)
if casted == loc:
loc = casted
i = <Py_ssize_t> loc
sz = cnp.PyArray_SIZE(arr)
if i < 0:
i += sz
elif i >= sz:
raise IndexError('index out of bounds')
assign_value_1d(arr, i, value)
Run Code Online (Sandbox Code Playgroud)
Which itself uses this C function:
PANDAS_INLINE int assign_value_1d(PyArrayObject* ap, Py_ssize_t _i,
PyObject* v) {
npy_intp i = (npy_intp)_i;
char* item = (char*)PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0);
return PyArray_DESCR(ap)->f->setitem(v, item, ap);
}
Run Code Online (Sandbox Code Playgroud)
applyThis is the implementation of the map_infer function:
def map_infer(ndarray arr, object f, bint convert=1):
cdef:
Py_ssize_t i, n
ndarray[object] result
object val
n = len(arr)
result = np.empty(n, dtype=object)
for i in range(n):
val = f(util.get_value_at(arr, i))
# unbox 0-dim arrays, GH #690
if is_array(val) and PyArray_NDIM(val) == 0:
# is there a faster way to unbox?
val = val.item()
result[i] = val
if convert:
return maybe_convert_objects(result,
try_float=0,
convert_datetime=0,
convert_timedelta=0)
return result
Run Code Online (Sandbox Code Playgroud)
With this helper:
cdef inline object get_value_at(ndarray arr, object loc):
cdef:
Py_ssize_t i, sz
int casted
if is_float_object(loc):
casted = int(loc)
if casted == loc:
loc = casted
i = <Py_ssize_t> loc
sz = cnp.PyArray_SIZE(arr)
if i < 0 and sz > 0:
i += sz
elif i >= sz or sz == 0:
raise IndexError('index out of bounds')
return get_value_1d(arr, i)
Run Code Online (Sandbox Code Playgroud)
Which uses this C function:
PANDAS_INLINE PyObject* get_value_1d(PyArrayObject* ap, Py_ssize_t i) {
char* item = (char*)PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0);
return PyArray_Scalar(item, PyArray_DESCR(ap), (PyObject*)ap);
}
Run Code Online (Sandbox Code Playgroud)
There are some differences between the Cython codes that are called eventually.
The one taken by astype uses unicode while the apply path uses the function passed in. Let's see if that makes a difference (again IPython/Jupyter makes it very easy to compile Cython code yourself):
%load_ext cython
%%cython
import numpy as np
cimport numpy as np
cpdef object func_called_by_astype(np.ndarray arr):
cdef np.ndarray[object] ret = np.empty(arr.size, dtype=object)
for i in range(arr.size):
ret[i] = unicode(arr[i])
return ret
cpdef object func_called_by_apply(np.ndarray arr, object f):
cdef np.ndarray[object] ret = np.empty(arr.size, dtype=object)
for i in range(arr.size):
ret[i] = f(arr[i])
return ret
Run Code Online (Sandbox Code Playgroud)
Timing:
import numpy as np
arr = np.random.randint(0, 10000, 1000000)
%timeit func_called_by_astype(arr)
514 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_called_by_apply(arr, str)
632 ms ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)
Okay, there is a difference but it's wrong, it would actually indicate that apply would be slightly slower.
But remember the asobject call that I mentioned earlier in the apply function? Could that be the reason? Let's see:
import numpy as np
arr = np.random.randint(0, 10000, 1000000)
%timeit func_called_by_astype(arr)
557 ms ± 33.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_called_by_apply(arr.astype(object), str)
317 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)
Now it looks better. The conversion to an object array made the function called by apply much faster. There is a simple reason for this: str is a Python function and these are generally much faster if you already have Python objects and NumPy (or Pandas) don't need to create a Python wrapper for the value stored in the array (which is generally not a Python object, except when the array is of dtype object).
However that doesn't explain the huge difference that you've seen. My suspicion is that there is actually an additional difference in the ways the arrays are iterated over and the elements are set in the result. Very likely the:
val = f(util.get_value_at(arr, i))
if is_array(val) and PyArray_NDIM(val) == 0:
val = val.item()
result[i] = val
Run Code Online (Sandbox Code Playgroud)
part of the map_infer function is faster than:
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, unicode(arr[i]))
Run Code Online (Sandbox Code Playgroud)
which is called by the astype(str) path. The comments of the first function seem to indicate that the writer of map_infer actually tried to make the code as fast as possible (see the comment about "is there a faster way to unbox?" while the other one maybe was written without special care about performance. But that's just a guess.
Also on my computer I'm actually quite close to the performance of the x.astype(str) and x.apply(str) already:
import numpy as np
arr = np.random.randint(0, 100, 1000000)
s = pd.Series(arr)
%timeit s.astype(str)
535 ms ± 23.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_called_by_astype(arr)
547 ms ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit s.apply(str)
216 ms ± 8.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_called_by_apply(arr.astype(object), str)
272 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)
Note that I also checked some other variants that return a different result:
%timeit s.values.astype(str) # array of strings
407 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list(map(str, s.values.tolist())) # list of strings
184 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Run Code Online (Sandbox Code Playgroud)
Interestingly the Python loop with list and map seems to be the fastest on my computer.
I actually made a small benchmark including plot:
import pandas as pd
import simple_benchmark
def Series_astype(series):
return series.astype(str)
def Series_apply(series):
return series.apply(str)
def Series_tolist_map(series):
return list(map(str, series.values.tolist()))
def Series_values_astype(series):
return series.values.astype(str)
arguments = {2**i: pd.Series(np.random.randint(0, 100, 2**i)) for i in range(2, 20)}
b = simple_benchmark.benchmark(
[Series_astype, Series_apply, Series_tolist_map, Series_values_astype],
arguments,
argument_name='Series size'
)
%matplotlib notebook
b.plot()
Run Code Online (Sandbox Code Playgroud)
Note that it's a log-log plot because of the huge range of sizes I covered in the benchmark. However lower means faster here.
The results may be different for different versions of Python/NumPy/Pandas. So if you want to compare it, these are my versions:
Versions
--------
Python 3.6.5
NumPy 1.14.2
Pandas 0.22.0
Run Code Online (Sandbox Code Playgroud)
jpp*_*jpp 14
性能
值得一开始以来的任何调查之前看实际表现,违背民意,list(map(str, x))似乎是慢比x.apply(str).
import pandas as pd, numpy as np
### Versions: Pandas 0.20.3, Numpy 1.13.1, Python 3.6.2 ###
x = pd.Series(np.random.randint(0, 100, 100000))
%timeit x.apply(str) # 42ms (1)
%timeit x.map(str) # 42ms (2)
%timeit x.astype(str) # 559ms (3)
%timeit [str(i) for i in x] # 566ms (4)
%timeit list(map(str, x)) # 536ms (5)
%timeit x.values.astype(str) # 25ms (6)
Run Code Online (Sandbox Code Playgroud)
值得注意的要点:
lambda功能].为什么x.map/x.apply快?
这似乎是因为它使用快速编译的Cython代码:
cpdef ndarray[object] astype_str(ndarray arr):
cdef:
Py_ssize_t i, n = arr.size
ndarray[object] result = np.empty(n, dtype=object)
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, str(arr[i]))
return result
Run Code Online (Sandbox Code Playgroud)
为什么x.astype(str)慢?
熊猫适用str于系列中的每个项目,而不是使用上面的Cython.
因此,性能与[str(i) for i in x]/ 相当list(map(str, x)).
为什么x.values.astype(str)这么快?
Numpy不对数组的每个元素应用函数.我找到的一个描述:
如果你做了
s.values.astype(str)什么你得到的是一个对象持有int.这是numpy在做转换,而pandas迭代每个项目并调用str(item)它.所以,如果你s.astype(str)有一个对象持有str.
在没有空的情况下,为什么没有实现numpy版本有一个技术原因.
| 归档时间: |
|
| 查看次数: |
1490 次 |
| 最近记录: |