将一系列int转换为字符串 - 为什么应用比astype快得多?

Cha*_*les 26 python string performance python-internals pandas

我有一个pandas.Series包含整数,但我需要将这些转换为字符串以用于某些下游工具.所以假设我有一个Series对象:

import numpy as np
import pandas as pd

x = pd.Series(np.random.randint(0, 100, 1000000))
Run Code Online (Sandbox Code Playgroud)

在StackOverflow和其他网站上,我见过大多数人认为最好的方法是:

%% timeit
x = x.astype(str)
Run Code Online (Sandbox Code Playgroud)

这大约需要2秒钟.

使用时x = x.apply(str),只需0.2秒.

为什么x.astype(str)这么慢?应该推荐的方式x.apply(str)吗?

我主要对python 3的行为感兴趣.

MSe*_*ert 22

让我们从一些一般建议开始:如果您有兴趣找到Python代码的瓶颈,您可以使用分析器来查找大部分时间吃掉的功能/部件.在这种情况下,我使用行分析器,因为您实际上可以看到实现和每行所花费的时间.

但是,默认情况下,这些工具不适用于C或Cython.鉴于CPython(我正在使用的是Python解释器),NumPy和pandas大量使用C和Cython,我将在一定程度上限制分析.

Actually: one probably could extend profiling to the Cython code and probably also the C code by recompiling it with debug symbols and tracing, however it's not an easy task to compile these libraries so I won't do that (but if someone likes to do that the Cython documentation includes a page about profiling Cython code).

But let's see how far I can get:

Line-Profiling Python code

I'm going to use line-profiler and a Jupyter Notebook here:

%load_ext line_profiler

import numpy as np
import pandas as pd

x = pd.Series(np.random.randint(0, 100, 100000))
Run Code Online (Sandbox Code Playgroud)

Profiling x.astype

%lprun -f x.astype x.astype(str)
Run Code Online (Sandbox Code Playgroud)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    87                                                   @wraps(func)
    88                                                   def wrapper(*args, **kwargs):
    89         1           12     12.0      0.0              old_arg_value = kwargs.pop(old_arg_name, None)
    90         1            5      5.0      0.0              if old_arg_value is not None:
    91                                                           if mapping is not None:
   ...
   118         1       663354 663354.0    100.0              return func(*args, **kwargs)
Run Code Online (Sandbox Code Playgroud)

So that's simply a decorator and 100% of the time is spent in the decorated function. So let's profile the decorated function:

%lprun -f x.astype.__wrapped__ x.astype(str)
Run Code Online (Sandbox Code Playgroud)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  3896                                               @deprecate_kwarg(old_arg_name='raise_on_error', new_arg_name='errors',
  3897                                                                mapping={True: 'raise', False: 'ignore'})
  3898                                               def astype(self, dtype, copy=True, errors='raise', **kwargs):
  3899                                                   """
  ...
  3975                                                   """
  3976         1           28     28.0      0.0          if is_dict_like(dtype):
  3977                                                       if self.ndim == 1:  # i.e. Series
  ...
  4001                                           
  4002                                                   # else, only a single dtype is given
  4003         1           14     14.0      0.0          new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
  4004         1       685863 685863.0     99.9                                       **kwargs)
  4005         1          340    340.0      0.0          return self._constructor(new_data).__finalize__(self)
Run Code Online (Sandbox Code Playgroud)

Source

Again one line is the bottleneck so let's check the _data.astype method:

%lprun -f x._data.astype x.astype(str)
Run Code Online (Sandbox Code Playgroud)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  3461                                               def astype(self, dtype, **kwargs):
  3462         1       695866 695866.0    100.0          return self.apply('astype', dtype=dtype, **kwargs)
Run Code Online (Sandbox Code Playgroud)

Okay, another delegate, let's see what _data.apply does:

%lprun -f x._data.apply x.astype(str)
Run Code Online (Sandbox Code Playgroud)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  3251                                               def apply(self, f, axes=None, filter=None, do_integrity_check=False,
  3252                                                         consolidate=True, **kwargs):
  3253                                                   """
  ...
  3271                                                   """
  3272                                           
  3273         1           12     12.0      0.0          result_blocks = []
  ...
  3309                                           
  3310         1           10     10.0      0.0          aligned_args = dict((k, kwargs[k])
  3311         1           29     29.0      0.0                              for k in align_keys
  3312                                                                       if hasattr(kwargs[k], 'reindex_axis'))
  3313                                           
  3314         2           28     14.0      0.0          for b in self.blocks:
  ...
  3329         1       674974 674974.0    100.0              applied = getattr(b, f)(**kwargs)
  3330         1           30     30.0      0.0              result_blocks = _extend_blocks(applied, result_blocks)
  3331                                           
  3332         1           10     10.0      0.0          if len(result_blocks) == 0:
  3333                                                       return self.make_empty(axes or self.axes)
  3334         1           10     10.0      0.0          bm = self.__class__(result_blocks, axes or self.axes,
  3335         1           76     76.0      0.0                              do_integrity_check=do_integrity_check)
  3336         1           13     13.0      0.0          bm._consolidate_inplace()
  3337         1            7      7.0      0.0          return bm
Run Code Online (Sandbox Code Playgroud)

Source

而且......一个函数调用一直在进行,这次是x._data.blocks[0].astype:

%lprun -f x._data.blocks[0].astype x.astype(str)
Run Code Online (Sandbox Code Playgroud)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   542                                               def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
   543         1           18     18.0      0.0          return self._astype(dtype, copy=copy, errors=errors, values=values,
   544         1       671092 671092.0    100.0                              **kwargs)
Run Code Online (Sandbox Code Playgroud)

..这是另一位代表.​​.....

%lprun -f x._data.blocks[0]._astype x.astype(str)
Run Code Online (Sandbox Code Playgroud)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   546                                               def _astype(self, dtype, copy=False, errors='raise', values=None,
   547                                                           klass=None, mgr=None, **kwargs):
   548                                                   """
   ...
   557                                                   """
   558         1           11     11.0      0.0          errors_legal_values = ('raise', 'ignore')
   559                                           
   560         1            8      8.0      0.0          if errors not in errors_legal_values:
   561                                                       invalid_arg = ("Expected value of kwarg 'errors' to be one of {}. "
   562                                                                      "Supplied value is '{}'".format(
   563                                                                          list(errors_legal_values), errors))
   564                                                       raise ValueError(invalid_arg)
   565                                           
   566         1           23     23.0      0.0          if inspect.isclass(dtype) and issubclass(dtype, ExtensionDtype):
   567                                                       msg = ("Expected an instance of {}, but got the class instead. "
   568                                                              "Try instantiating 'dtype'.".format(dtype.__name__))
   569                                                       raise TypeError(msg)
   570                                           
   571                                                   # may need to convert to categorical
   572                                                   # this is only called for non-categoricals
   573         1           72     72.0      0.0          if self.is_categorical_astype(dtype):
   ...
   595                                           
   596                                                   # astype processing
   597         1           16     16.0      0.0          dtype = np.dtype(dtype)
   598         1           19     19.0      0.0          if self.dtype == dtype:
   ...
   603         1            8      8.0      0.0          if klass is None:
   604         1           13     13.0      0.0              if dtype == np.object_:
   605                                                           klass = ObjectBlock
   606         1            6      6.0      0.0          try:
   607                                                       # force the copy here
   608         1            7      7.0      0.0              if values is None:
   609                                           
   610         1            8      8.0      0.0                  if issubclass(dtype.type,
   611         1           14     14.0      0.0                                (compat.text_type, compat.string_types)):
   612                                           
   613                                                               # use native type formatting for datetime/tz/timedelta
   614         1           15     15.0      0.0                      if self.is_datelike:
   615                                                                   values = self.to_native_types()
   616                                           
   617                                                               # astype formatting
   618                                                               else:
   619         1            8      8.0      0.0                          values = self.values
   620                                           
   621                                                           else:
   622                                                               values = self.get_values(dtype=dtype)
   623                                           
   624                                                           # _astype_nansafe works fine with 1-d only
   625         1       665777 665777.0     99.9                  values = astype_nansafe(values.ravel(), dtype, copy=True)
   626         1           32     32.0      0.0                  values = values.reshape(self.shape)
   627                                           
   628         1           17     17.0      0.0              newb = make_block(values, placement=self.mgr_locs, dtype=dtype,
   629         1          269    269.0      0.0                                klass=klass)
   630                                                   except:
   631                                                       if errors == 'raise':
   632                                                           raise
   633                                                       newb = self.copy() if copy else self
   634                                           
   635         1            8      8.0      0.0          if newb.is_numeric and self.is_numeric:
   ...
   642         1            6      6.0      0.0          return newb
Run Code Online (Sandbox Code Playgroud)

资源

......好吧,还没有.我们来看看astype_nansafe:

%lprun -f pd.core.internals.astype_nansafe x.astype(str)
Run Code Online (Sandbox Code Playgroud)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   640                                           def astype_nansafe(arr, dtype, copy=True):
   641                                               """ return a view if copy is False, but
   642                                                   need to be very careful as the result shape could change! """
   643         1           13     13.0      0.0      if not isinstance(dtype, np.dtype):
   644                                                   dtype = pandas_dtype(dtype)
   645                                           
   646         1            8      8.0      0.0      if issubclass(dtype.type, text_type):
   647                                                   # in Py3 that's str, in Py2 that's unicode
   648         1       663317 663317.0    100.0          return lib.astype_unicode(arr.ravel()).reshape(arr.shape)
   ...
Run Code Online (Sandbox Code Playgroud)

资源

再一次,它是100行的一行,所以我将进一步发挥一个功能:

%lprun -f pd.core.dtypes.cast.lib.astype_unicode x.astype(str)

UserWarning: Could not extract a code object for the object <built-in function astype_unicode>
Run Code Online (Sandbox Code Playgroud)

好的,我们发现了一个built-in function,这意味着它是一个C函数.在这种情况下,它是一个Cython函数.但这意味着我们无法通过line-profiler深入挖掘.所以我现在就到此为止.

剖析 x.apply

%lprun -f x.apply x.apply(str)
Run Code Online (Sandbox Code Playgroud)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  2426                                               def apply(self, func, convert_dtype=True, args=(), **kwds):
  2427                                                   """
  ...
  2523                                                   """
  2524         1           84     84.0      0.0          if len(self) == 0:
  2525                                                       return self._constructor(dtype=self.dtype,
  2526                                                                                index=self.index).__finalize__(self)
  2527                                           
  2528                                                   # dispatch to agg
  2529         1           11     11.0      0.0          if isinstance(func, (list, dict)):
  2530                                                       return self.aggregate(func, *args, **kwds)
  2531                                           
  2532                                                   # if we are a string, try to dispatch
  2533         1           12     12.0      0.0          if isinstance(func, compat.string_types):
  2534                                                       return self._try_aggregate_string_function(func, *args, **kwds)
  2535                                           
  2536                                                   # handle ufuncs and lambdas
  2537         1            7      7.0      0.0          if kwds or args and not isinstance(func, np.ufunc):
  2538                                                       f = lambda x: func(x, *args, **kwds)
  2539                                                   else:
  2540         1            6      6.0      0.0              f = func
  2541                                           
  2542         1          154    154.0      0.1          with np.errstate(all='ignore'):
  2543         1           11     11.0      0.0              if isinstance(f, np.ufunc):
  2544                                                           return f(self)
  2545                                           
  2546                                                       # row-wise access
  2547         1          188    188.0      0.1              if is_extension_type(self.dtype):
  2548                                                           mapped = self._values.map(f)
  2549                                                       else:
  2550         1         6238   6238.0      3.3                  values = self.asobject
  2551         1       181910 181910.0     95.5                  mapped = lib.map_infer(values, f, convert=convert_dtype)
  2552                                           
  2553         1           28     28.0      0.0          if len(mapped) and isinstance(mapped[0], Series):
  2554                                                       from pandas.core.frame import DataFrame
  2555                                                       return DataFrame(mapped.tolist(), index=self.index)
  2556                                                   else:
  2557         1           19     19.0      0.0              return self._constructor(mapped,
  2558         1         1870   1870.0      1.0                                       index=self.index).__finalize__(self)
Run Code Online (Sandbox Code Playgroud)

资源

同样,这是一个大部分时间都需要的功能:lib.map_infer......

%lprun -f pd.core.series.lib.map_infer x.apply(str)
Run Code Online (Sandbox Code Playgroud)
Could not extract a code object for the object <built-in function map_infer>
Run Code Online (Sandbox Code Playgroud)

好的,这是另一个Cython功能.

这时候还有另外一个(虽然不太显著)与贡献者〜3%: values = self.asobject.但我暂时忽略了这一点,因为我们对主要贡献者感兴趣.

进入C/Cython

The functions called by astype

This is the astype_unicode function:

cpdef ndarray[object] astype_unicode(ndarray arr):
    cdef:
        Py_ssize_t i, n = arr.size
        ndarray[object] result = np.empty(n, dtype=object)

    for i in range(n):
        # we can use the unsafe version because we know `result` is mutable
        # since it was created from `np.empty`
        util.set_value_at_unsafe(result, i, unicode(arr[i]))

    return result
Run Code Online (Sandbox Code Playgroud)

Source

This function uses this helper:

cdef inline set_value_at_unsafe(ndarray arr, object loc, object value):
    cdef:
        Py_ssize_t i, sz
    if is_float_object(loc):
        casted = int(loc)
        if casted == loc:
            loc = casted
    i = <Py_ssize_t> loc
    sz = cnp.PyArray_SIZE(arr)

    if i < 0:
        i += sz
    elif i >= sz:
        raise IndexError('index out of bounds')

    assign_value_1d(arr, i, value)
Run Code Online (Sandbox Code Playgroud)

Source

Which itself uses this C function:

PANDAS_INLINE int assign_value_1d(PyArrayObject* ap, Py_ssize_t _i,
                                  PyObject* v) {
    npy_intp i = (npy_intp)_i;
    char* item = (char*)PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0);
    return PyArray_DESCR(ap)->f->setitem(v, item, ap);
}
Run Code Online (Sandbox Code Playgroud)

Source

Functions called by apply

This is the implementation of the map_infer function:

def map_infer(ndarray arr, object f, bint convert=1):
    cdef:
        Py_ssize_t i, n
        ndarray[object] result
        object val

    n = len(arr)
    result = np.empty(n, dtype=object)
    for i in range(n):
        val = f(util.get_value_at(arr, i))

        # unbox 0-dim arrays, GH #690
        if is_array(val) and PyArray_NDIM(val) == 0:
            # is there a faster way to unbox?
            val = val.item()

        result[i] = val

    if convert:
        return maybe_convert_objects(result,
                                     try_float=0,
                                     convert_datetime=0,
                                     convert_timedelta=0)

    return result
Run Code Online (Sandbox Code Playgroud)

Source

With this helper:

cdef inline object get_value_at(ndarray arr, object loc):
    cdef:
        Py_ssize_t i, sz
        int casted

    if is_float_object(loc):
        casted = int(loc)
        if casted == loc:
            loc = casted
    i = <Py_ssize_t> loc
    sz = cnp.PyArray_SIZE(arr)

    if i < 0 and sz > 0:
        i += sz
    elif i >= sz or sz == 0:
        raise IndexError('index out of bounds')

    return get_value_1d(arr, i)
Run Code Online (Sandbox Code Playgroud)

Source

Which uses this C function:

PANDAS_INLINE PyObject* get_value_1d(PyArrayObject* ap, Py_ssize_t i) {
    char* item = (char*)PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0);
    return PyArray_Scalar(item, PyArray_DESCR(ap), (PyObject*)ap);
}
Run Code Online (Sandbox Code Playgroud)

Source

Some thoughts on the Cython code

There are some differences between the Cython codes that are called eventually.

The one taken by astype uses unicode while the apply path uses the function passed in. Let's see if that makes a difference (again IPython/Jupyter makes it very easy to compile Cython code yourself):

%load_ext cython

%%cython

import numpy as np
cimport numpy as np

cpdef object func_called_by_astype(np.ndarray arr):
    cdef np.ndarray[object] ret = np.empty(arr.size, dtype=object)
    for i in range(arr.size):
        ret[i] = unicode(arr[i])
    return ret

cpdef object func_called_by_apply(np.ndarray arr, object f):
    cdef np.ndarray[object] ret = np.empty(arr.size, dtype=object)
    for i in range(arr.size):
        ret[i] = f(arr[i])
    return ret
Run Code Online (Sandbox Code Playgroud)

Timing:

import numpy as np

arr = np.random.randint(0, 10000, 1000000)
%timeit func_called_by_astype(arr)
514 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_called_by_apply(arr, str)
632 ms ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)

Okay, there is a difference but it's wrong, it would actually indicate that apply would be slightly slower.

But remember the asobject call that I mentioned earlier in the apply function? Could that be the reason? Let's see:

import numpy as np

arr = np.random.randint(0, 10000, 1000000)
%timeit func_called_by_astype(arr)
557 ms ± 33.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_called_by_apply(arr.astype(object), str)
317 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)

Now it looks better. The conversion to an object array made the function called by apply much faster. There is a simple reason for this: str is a Python function and these are generally much faster if you already have Python objects and NumPy (or Pandas) don't need to create a Python wrapper for the value stored in the array (which is generally not a Python object, except when the array is of dtype object).

However that doesn't explain the huge difference that you've seen. My suspicion is that there is actually an additional difference in the ways the arrays are iterated over and the elements are set in the result. Very likely the:

val = f(util.get_value_at(arr, i))
if is_array(val) and PyArray_NDIM(val) == 0:
    val = val.item()
result[i] = val
Run Code Online (Sandbox Code Playgroud)

part of the map_infer function is faster than:

for i in range(n):
    # we can use the unsafe version because we know `result` is mutable
    # since it was created from `np.empty`
    util.set_value_at_unsafe(result, i, unicode(arr[i]))
Run Code Online (Sandbox Code Playgroud)

which is called by the astype(str) path. The comments of the first function seem to indicate that the writer of map_infer actually tried to make the code as fast as possible (see the comment about "is there a faster way to unbox?" while the other one maybe was written without special care about performance. But that's just a guess.

Also on my computer I'm actually quite close to the performance of the x.astype(str) and x.apply(str) already:

import numpy as np

arr = np.random.randint(0, 100, 1000000)
s = pd.Series(arr)
%timeit s.astype(str)
535 ms ± 23.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_called_by_astype(arr)
547 ms ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


%timeit s.apply(str)
216 ms ± 8.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_called_by_apply(arr.astype(object), str)
272 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)

Note that I also checked some other variants that return a different result:

%timeit s.values.astype(str)  # array of strings
407 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list(map(str, s.values.tolist()))  # list of strings
184 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Run Code Online (Sandbox Code Playgroud)

Interestingly the Python loop with list and map seems to be the fastest on my computer.

I actually made a small benchmark including plot:

import pandas as pd
import simple_benchmark

def Series_astype(series):
    return series.astype(str)

def Series_apply(series):
    return series.apply(str)

def Series_tolist_map(series):
    return list(map(str, series.values.tolist()))

def Series_values_astype(series):
    return series.values.astype(str)


arguments = {2**i: pd.Series(np.random.randint(0, 100, 2**i)) for i in range(2, 20)}
b = simple_benchmark.benchmark(
    [Series_astype, Series_apply, Series_tolist_map, Series_values_astype],
    arguments,
    argument_name='Series size'
)

%matplotlib notebook
b.plot()
Run Code Online (Sandbox Code Playgroud)

enter image description here

Note that it's a log-log plot because of the huge range of sizes I covered in the benchmark. However lower means faster here.

The results may be different for different versions of Python/NumPy/Pandas. So if you want to compare it, these are my versions:

Versions
--------
Python 3.6.5
NumPy 1.14.2
Pandas 0.22.0
Run Code Online (Sandbox Code Playgroud)


jpp*_*jpp 14

性能

值得一开始以来的任何调查之前看实际表现,违背民意,list(map(str, x))似乎是x.apply(str).

import pandas as pd, numpy as np

### Versions: Pandas 0.20.3, Numpy 1.13.1, Python 3.6.2 ###

x = pd.Series(np.random.randint(0, 100, 100000))

%timeit x.apply(str)          # 42ms   (1)
%timeit x.map(str)            # 42ms   (2)
%timeit x.astype(str)         # 559ms  (3)
%timeit [str(i) for i in x]   # 566ms  (4)
%timeit list(map(str, x))     # 536ms  (5)
%timeit x.values.astype(str)  # 25ms   (6)
Run Code Online (Sandbox Code Playgroud)

值得注意的要点:

  1. (5)比(3)/(4)略快,我们期望随着更多的工作进入C [假设没有使用lambda功能].
  2. (6)是迄今为止最快的.
  3. (1)/(2)是相似的.
  4. (3)/(4)是相似的.

为什么x.map/x.apply快?

似乎是因为它使用快速编译的Cython代码:

cpdef ndarray[object] astype_str(ndarray arr):
    cdef:
        Py_ssize_t i, n = arr.size
        ndarray[object] result = np.empty(n, dtype=object)

    for i in range(n):
        # we can use the unsafe version because we know `result` is mutable
        # since it was created from `np.empty`
        util.set_value_at_unsafe(result, i, str(arr[i]))

    return result
Run Code Online (Sandbox Code Playgroud)

为什么x.astype(str)慢?

熊猫适用str于系列中的每个项目,而不是使用上面的Cython.

因此,性能与[str(i) for i in x]/ 相当list(map(str, x)).

为什么x.values.astype(str)这么快?

Numpy不对数组的每个元素应用函数.我找到的一个描述:

如果你做了s.values.astype(str)什么你得到的是一个对象持有 int.这是numpy在做转换,而pandas迭代每个项目并调用str(item)它.所以,如果你s.astype(str)有一个对象持有str.

在没有空的情况下,为什么没有实现numpy版本有一个技术原因.

  • 您可能想要表示您正在进行基准测试的NumPy,Pandas,Python的版本以及您的计算机规格.否则这不是很有意义.例如,w/NumPy 1.14.1,Pandas 0.22.0,Python 3.6.4,也用'%timeit`,`x.apply(str)`进行基准测试需要18ms和`list(map(str,x))`我的样本数据需要15​​ms.基准订单完全不同. (2认同)