使用增量值有效地创建新列

Question

使用增量值有效地创建新列

Nan*_*ani 19 python performance numpy pandas

我正在创建一个包含增量值的列,然后在列的开头添加一个字符串.当用于大数据时,这非常慢.请为此建议更快更有效的方法.

df['New_Column'] = np.arange(df[0])+1
df['New_Column'] = 'str' + df['New_Column'].astype(str)

Run Code Online (Sandbox Code Playgroud)

输入

id  Field   Value
1     A       1
2     B       0     
3     D       1

Run Code Online (Sandbox Code Playgroud)

产量

id  Field   Value   New_Column
1     A       1     str_1
2     B       0     str_2
3     D       1     str_3

Run Code Online (Sandbox Code Playgroud)

Answer 1

piR*_*red 24

我将在混音中添加两个

NumPy的

from numpy.core.defchararray import add

df.assign(new=add('str_', np.arange(1, len(df) + 1).astype(str)))

   id Field  Value    new
0   1     A      1  str_1
1   2     B      0  str_2
2   3     D      1  str_3

Run Code Online (Sandbox Code Playgroud)

`f-string` 在理解中

Python 3.6+

df.assign(new=[f'str_{i}' for i in range(1, len(df) + 1)])

   id Field  Value    new
0   1     A      1  str_1
1   2     B      0  str_2
2   3     D      1  str_3

Run Code Online (Sandbox Code Playgroud)

时间测试

结论

理解与简单相关的表现赢得了胜利.请注意,这是cᴏʟᴅsᴘᴇᴇᴅ提出的方法.我很欣赏这些赞成票(谢谢你),但是我们应该归功于它的应有之处.

对理解进行Cython化似乎没有帮助.f弦也没有.
Divakar numexp在更大的数据上表现出色.

功能

%load_ext Cython

Run Code Online (Sandbox Code Playgroud)

%%cython
def gen_list(l, h):
    return ['str_%s' % i for i in range(l, h)]

Run Code Online (Sandbox Code Playgroud)

pir1 = lambda d: d.assign(new=[f'str_{i}' for i in range(1, len(d) + 1)])
pir2 = lambda d: d.assign(new=add('str_', np.arange(1, len(d) + 1).astype(str)))
cld1 = lambda d: d.assign(new=['str_%s' % i for i in range(1, len(d) + 1)])
cld2 = lambda d: d.assign(new=gen_list(1, len(d) + 1))
jez1 = lambda d: d.assign(new='str_' + pd.Series(np.arange(1, len(d) + 1), d.index).astype(str))
div1 = lambda d: d.assign(new=create_inc_pattern(prefix_str='str_', start=1, stop=len(d) + 1))
div2 = lambda d: d.assign(new=create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=len(d) + 1))

Run Code Online (Sandbox Code Playgroud)

测试

res = pd.DataFrame(
    np.nan, [10, 30, 100, 300, 1000, 3000, 10000, 30000],
    'pir1 pir2 cld1 cld2 jez1 div1 div2'.split()
)

for i in res.index:
    d = pd.concat([df] * i)
    for j in res.columns:
        stmt = f'{j}(d)'
        setp = f'from __main__ import {j}, d'
        res.at[i, j] = timeit(stmt, setp, number=200)

Run Code Online (Sandbox Code Playgroud)

结果

res.plot(loglog=True)

Run Code Online (Sandbox Code Playgroud)

res.div(res.min(1), 0)

           pir1      pir2      cld1      cld2       jez1      div1      div2
10     1.243998  1.137877  1.006501  1.000000   1.798684  1.277133  1.427025
30     1.009771  1.144892  1.012283  1.000000   2.144972  1.210803  1.283230
100    1.090170  1.567300  1.039085  1.000000   3.134154  1.281968  1.356706
300    1.061804  2.260091  1.072633  1.000000   4.792343  1.051886  1.305122
1000   1.135483  3.401408  1.120250  1.033484   7.678876  1.077430  1.000000
3000   1.310274  5.179131  1.359795  1.362273  13.006764  1.317411  1.000000
10000  2.110001  7.861251  1.942805  1.696498  17.905551  1.974627  1.000000
30000  2.188024  8.236724  2.100529  1.872661  18.416222  1.875299  1.000000

Run Code Online (Sandbox Code Playgroud)

提议的方法

在对字符串和数字dtypes进行了大量修改并利用它们之间的简单互操作性之后,我最终得到的是获得零填充字符串,因为NumPy运行良好并允许以这种方式进行矢量化操作 -

def create_inc_pattern(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
    a1 = np.repeat(a0[None],N,axis=0)

    r = np.arange(start, stop)
    addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
    a1[:,len(prefix_str):] += addn.astype(a1.dtype)
    return a1.view('S'+str(a1.shape[1])).ravel()

Run Code Online (Sandbox Code Playgroud)

加入numexpr更快的广播+模数运算 -

import numexpr as ne

def create_inc_pattern_numexpr(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
    a1 = np.repeat(a0[None],N,axis=0)

    r = np.arange(start, stop)
    r2D = r[:,None]
    s = 10**np.arange(W-1,-1,-1)
    addn = ne.evaluate('(r2D/s)%10')
    a1[:,len(prefix_str):] += addn.astype(a1.dtype)
    return a1.view('S'+str(a1.shape[1])).ravel()

Run Code Online (Sandbox Code Playgroud)

因此,要用作新列:

df['New_Column'] = create_inc_pattern(prefix_str='str_', start=1, stop=len(df)+1)

Run Code Online (Sandbox Code Playgroud)

样品运行 -

In [334]: create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=14)
Out[334]: 
array(['str_01', 'str_02', 'str_03', 'str_04', 'str_05', 'str_06',
       'str_07', 'str_08', 'str_09', 'str_10', 'str_11', 'str_12', 'str_13'], 
      dtype='|S6')

In [338]: create_inc_pattern(prefix_str='str_', start=1, stop=124)
Out[338]: 
array(['str_001', 'str_002', 'str_003', 'str_004', 'str_005', 'str_006',
       'str_007', 'str_008', 'str_009', 'str_010', 'str_011', 'str_012',..
       'str_115', 'str_116', 'str_117', 'str_118', 'str_119', 'str_120',
       'str_121', 'str_122', 'str_123'], 
      dtype='|S7')

Run Code Online (Sandbox Code Playgroud)

说明

逐步运行示例的基本思路和解释

基本思想是创建ASCII等效数字数组,可以通过dtype转换查看或转换为字符串1.更具体地说,我们将创建uint8类型的数字.因此,每个字符串将由一维数字数组表示.对于将转换为2D数字数组的字符串列表,每行(1D数组)表示单个字符串.

1)输入:

In [22]: prefix_str='str_'
    ...: start=15
    ...: stop=24

Run Code Online (Sandbox Code Playgroud)

2)参数:

In [23]: N = stop - start # count of numbers
    ...: W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string

In [24]: N,W
Out[24]: (9, 2)

Run Code Online (Sandbox Code Playgroud)

3)创建表示起始字符串的一维数字数组:

In [25]: padv = np.full(W,48,dtype=np.uint8)
    ...: a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]

In [27]: a0
Out[27]: array([115, 116, 114,  95,  48,  48], dtype=uint8)

Run Code Online (Sandbox Code Playgroud)

4)扩展到覆盖作为2D阵列的字符串范围:

In [33]: a1 = np.repeat(a0[None],N,axis=0)
    ...: r = np.arange(start, stop)
    ...: addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
    ...: a1[:,len(prefix_str):] += addn.astype(a1.dtype)

In [34]: a1
Out[34]: 
array([[115, 116, 114,  95,  49,  53],
       [115, 116, 114,  95,  49,  54],
       [115, 116, 114,  95,  49,  55],
       [115, 116, 114,  95,  49,  56],
       [115, 116, 114,  95,  49,  57],
       [115, 116, 114,  95,  50,  48],
       [115, 116, 114,  95,  50,  49],
       [115, 116, 114,  95,  50,  50],
       [115, 116, 114,  95,  50,  51]], dtype=uint8)

Run Code Online (Sandbox Code Playgroud)

5)因此,每行代表一个字符串的ascii等价物,每个字符串都与所需的输出相关.让我们在最后一步得到它:

In [35]: a1.view('S'+str(a1.shape[1])).ravel()
Out[35]: 
array(['str_15', 'str_16', 'str_17', 'str_18', 'str_19', 'str_20',
       'str_21', 'str_22', 'str_23'], 
      dtype='|S6')

Run Code Online (Sandbox Code Playgroud)

计时

这是一个针对列表理解版本的快速时序测试,似乎是最好地查看其他帖子的时间 -

In [339]: N = 10000

In [340]: %timeit ['str_%s'%i for i in range(N)]
1000 loops, best of 3: 1.12 ms per loop

In [341]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
1000 loops, best of 3: 490 µs per loop

In [342]: N = 100000

In [343]: %timeit ['str_%s'%i for i in range(N)]
100 loops, best of 3: 14 ms per loop

In [344]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
100 loops, best of 3: 4 ms per loop

Run Code Online (Sandbox Code Playgroud)

Python-3代码

在Python-3上,要获取字符串dtype数组,我们需要在中间int dtype数组上填充更多的零.因此,没有和用于Python-3的numexpr版本的等价物最终成为这些方面的东西 -

方法#1(无numexpr):

def create_inc_pattern(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
    dl = len(prefix_str)+W # datatype length
    dt = np.uint8 # int datatype for string to-from conversion 

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]

    r = np.arange(start, stop)

    addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
    a1 = np.repeat(a0[None],N,axis=0)
    a1[:,len(prefix_str):] += addn.astype(dt)
    a1.shape = (-1)

    a2 = np.zeros((len(a1),4),dtype=dt)
    a2[:,0] = a1
    return np.frombuffer(a2.ravel(), dtype='U'+str(dl))

Run Code Online (Sandbox Code Playgroud)

方法#2(使用numexpr):

import numexpr as ne

def create_inc_pattern_numexpr(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
    dl = len(prefix_str)+W # datatype length
    dt = np.uint8 # int datatype for string to-from conversion 

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]

    r = np.arange(start, stop)

    r2D = r[:,None]
    s = 10**np.arange(W-1,-1,-1)
    addn = ne.evaluate('(r2D/s)%10')
    a1 = np.repeat(a0[None],N,axis=0)
    a1[:,len(prefix_str):] += addn.astype(dt)
    a1.shape = (-1)

    a2 = np.zeros((len(a1),4),dtype=dt)
    a2[:,0] = a1
    return np.frombuffer(a2.ravel(), dtype='U'+str(dl))

Run Code Online (Sandbox Code Playgroud)

计时 -

In [8]: N = 100000

In [9]: %timeit ['str_%s'%i for i in range(N)]
100 loops, best of 3: 18.5 ms per loop

In [10]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
100 loops, best of 3: 6.06 ms per loop

Run Code Online (Sandbox Code Playgroud)

这值得赏心悦目.如果不是太麻烦,你能不能添加评论,以便较少的凡人能够理解发生了什么？;) (3认同)

Answer 4

jez*_*ael 4

一种可能的解决方案是将值转换为strings by map：

df['New_Column'] = np.arange(len(df['a']))+1
df['New_Column'] = 'str_' + df['New_Column'].map(str)

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，10 月前
查看次数：	981 次
最近记录：	7 年前

使用增量值有效地创建新列

输入

产量

NumPy的

f-string 在理解中

时间测试

结论

功能

测试

结果

更多功能

提议的方法

说明

计时

Python-3代码

`f-string` 在理解中