Nan*_*ani 19 python performance numpy pandas
我正在创建一个包含增量值的列,然后在列的开头添加一个字符串.当用于大数据时,这非常慢.请为此建议更快更有效的方法.
df['New_Column'] = np.arange(df[0])+1
df['New_Column'] = 'str' + df['New_Column'].astype(str)
Run Code Online (Sandbox Code Playgroud)
id Field Value
1 A 1
2 B 0
3 D 1
Run Code Online (Sandbox Code Playgroud)
id Field Value New_Column
1 A 1 str_1
2 B 0 str_2
3 D 1 str_3
Run Code Online (Sandbox Code Playgroud)
piR*_*red 24
我将在混音中添加两个
from numpy.core.defchararray import add
df.assign(new=add('str_', np.arange(1, len(df) + 1).astype(str)))
id Field Value new
0 1 A 1 str_1
1 2 B 0 str_2
2 3 D 1 str_3
Run Code Online (Sandbox Code Playgroud)
f-string 在理解中df.assign(new=[f'str_{i}' for i in range(1, len(df) + 1)])
id Field Value new
0 1 A 1 str_1
1 2 B 0 str_2
2 3 D 1 str_3
Run Code Online (Sandbox Code Playgroud)
理解与简单相关的表现赢得了胜利.请注意,这是cᴏʟᴅsᴘᴇᴇᴅ提出的方法.我很欣赏这些赞成票(谢谢你),但是我们应该归功于它的应有之处.
对理解进行Cython化似乎没有帮助.f弦也没有.
Divakar numexp在更大的数据上表现出色.
%load_ext Cython
Run Code Online (Sandbox Code Playgroud)
%%cython
def gen_list(l, h):
return ['str_%s' % i for i in range(l, h)]
Run Code Online (Sandbox Code Playgroud)
pir1 = lambda d: d.assign(new=[f'str_{i}' for i in range(1, len(d) + 1)])
pir2 = lambda d: d.assign(new=add('str_', np.arange(1, len(d) + 1).astype(str)))
cld1 = lambda d: d.assign(new=['str_%s' % i for i in range(1, len(d) + 1)])
cld2 = lambda d: d.assign(new=gen_list(1, len(d) + 1))
jez1 = lambda d: d.assign(new='str_' + pd.Series(np.arange(1, len(d) + 1), d.index).astype(str))
div1 = lambda d: d.assign(new=create_inc_pattern(prefix_str='str_', start=1, stop=len(d) + 1))
div2 = lambda d: d.assign(new=create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=len(d) + 1))
Run Code Online (Sandbox Code Playgroud)
res = pd.DataFrame(
np.nan, [10, 30, 100, 300, 1000, 3000, 10000, 30000],
'pir1 pir2 cld1 cld2 jez1 div1 div2'.split()
)
for i in res.index:
d = pd.concat([df] * i)
for j in res.columns:
stmt = f'{j}(d)'
setp = f'from __main__ import {j}, d'
res.at[i, j] = timeit(stmt, setp, number=200)
Run Code Online (Sandbox Code Playgroud)
res.plot(loglog=True)
Run Code Online (Sandbox Code Playgroud)
res.div(res.min(1), 0)
pir1 pir2 cld1 cld2 jez1 div1 div2
10 1.243998 1.137877 1.006501 1.000000 1.798684 1.277133 1.427025
30 1.009771 1.144892 1.012283 1.000000 2.144972 1.210803 1.283230
100 1.090170 1.567300 1.039085 1.000000 3.134154 1.281968 1.356706
300 1.061804 2.260091 1.072633 1.000000 4.792343 1.051886 1.305122
1000 1.135483 3.401408 1.120250 1.033484 7.678876 1.077430 1.000000
3000 1.310274 5.179131 1.359795 1.362273 13.006764 1.317411 1.000000
10000 2.110001 7.861251 1.942805 1.696498 17.905551 1.974627 1.000000
30000 2.188024 8.236724 2.100529 1.872661 18.416222 1.875299 1.000000
Run Code Online (Sandbox Code Playgroud)
def create_inc_pattern(prefix_str, start, stop):
N = stop - start # count of numbers
W = int(np.ceil(np.log10(N+1))) # width of numeral part in string
dl = len(prefix_str)+W # datatype length
dt = np.uint8 # int datatype for string to-from conversion
padv = np.full(W,48,dtype=np.uint8)
a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
r = np.arange(start, stop)
addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
a1 = np.repeat(a0[None],N,axis=0)
a1[:,len(prefix_str):] += addn.astype(dt)
a1.shape = (-1)
a2 = np.zeros((len(a1),4),dtype=dt)
a2[:,0] = a1
return np.frombuffer(a2.ravel(), dtype='U'+str(dl))
import numexpr as ne
def create_inc_pattern_numexpr(prefix_str, start, stop):
N = stop - start # count of numbers
W = int(np.ceil(np.log10(N+1))) # width of numeral part in string
dl = len(prefix_str)+W # datatype length
dt = np.uint8 # int datatype for string to-from conversion
padv = np.full(W,48,dtype=np.uint8)
a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
r = np.arange(start, stop)
r2D = r[:,None]
s = 10**np.arange(W-1,-1,-1)
addn = ne.evaluate('(r2D/s)%10')
a1 = np.repeat(a0[None],N,axis=0)
a1[:,len(prefix_str):] += addn.astype(dt)
a1.shape = (-1)
a2 = np.zeros((len(a1),4),dtype=dt)
a2[:,0] = a1
return np.frombuffer(a2.ravel(), dtype='U'+str(dl))
Run Code Online (Sandbox Code Playgroud)
cs9*_*s95 16
当其他所有方法都失败时,请使用列表解析:
df['NewColumn'] = ['str_%s' %i for i in range(1, len(df) + 1)]
Run Code Online (Sandbox Code Playgroud)
如果您对函数进行cythonize,则可以进一步加速:
%load_ext Cython
%%cython
def gen_list(l, h):
return ['str_%s' %i for i in range(l, h)]
Run Code Online (Sandbox Code Playgroud)
注意,此代码在Python3.6.0(IPython6.2.1)上运行.感谢@hpaulj在评论中改进了解决方案.
# @jezrael's fastest solution
%%timeit
df['NewColumn'] = np.arange(len(df['a'])) + 1
df['NewColumn'] = 'str_' + df['New_Column'].map(str)
547 ms ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)
# in this post - no cython
%timeit df['NewColumn'] = ['str_%s'%i for i in range(n)]
409 ms ± 9.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)
# cythonized list comp
%timeit df['NewColumn'] = gen_list(1, len(df) + 1)
370 ms ± 9.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)
Div*_*kar 14
在对字符串和数字dtypes进行了大量修改并利用它们之间的简单互操作性之后,我最终得到的是获得零填充字符串,因为NumPy运行良好并允许以这种方式进行矢量化操作 -
def create_inc_pattern(prefix_str, start, stop):
N = stop - start # count of numbers
W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
padv = np.full(W,48,dtype=np.uint8)
a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
a1 = np.repeat(a0[None],N,axis=0)
r = np.arange(start, stop)
addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
a1[:,len(prefix_str):] += addn.astype(a1.dtype)
return a1.view('S'+str(a1.shape[1])).ravel()
Run Code Online (Sandbox Code Playgroud)
加入numexpr更快的广播+模数运算 -
import numexpr as ne
def create_inc_pattern_numexpr(prefix_str, start, stop):
N = stop - start # count of numbers
W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
padv = np.full(W,48,dtype=np.uint8)
a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
a1 = np.repeat(a0[None],N,axis=0)
r = np.arange(start, stop)
r2D = r[:,None]
s = 10**np.arange(W-1,-1,-1)
addn = ne.evaluate('(r2D/s)%10')
a1[:,len(prefix_str):] += addn.astype(a1.dtype)
return a1.view('S'+str(a1.shape[1])).ravel()
Run Code Online (Sandbox Code Playgroud)
因此,要用作新列:
df['New_Column'] = create_inc_pattern(prefix_str='str_', start=1, stop=len(df)+1)
Run Code Online (Sandbox Code Playgroud)
样品运行 -
In [334]: create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=14)
Out[334]:
array(['str_01', 'str_02', 'str_03', 'str_04', 'str_05', 'str_06',
'str_07', 'str_08', 'str_09', 'str_10', 'str_11', 'str_12', 'str_13'],
dtype='|S6')
In [338]: create_inc_pattern(prefix_str='str_', start=1, stop=124)
Out[338]:
array(['str_001', 'str_002', 'str_003', 'str_004', 'str_005', 'str_006',
'str_007', 'str_008', 'str_009', 'str_010', 'str_011', 'str_012',..
'str_115', 'str_116', 'str_117', 'str_118', 'str_119', 'str_120',
'str_121', 'str_122', 'str_123'],
dtype='|S7')
Run Code Online (Sandbox Code Playgroud)
逐步运行示例的基本思路和解释
基本思想是创建ASCII等效数字数组,可以通过dtype转换查看或转换为字符串1.更具体地说,我们将创建uint8类型的数字.因此,每个字符串将由一维数字数组表示.对于将转换为2D数字数组的字符串列表,每行(1D数组)表示单个字符串.
1)输入:
In [22]: prefix_str='str_'
...: start=15
...: stop=24
Run Code Online (Sandbox Code Playgroud)
2)参数:
In [23]: N = stop - start # count of numbers
...: W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
In [24]: N,W
Out[24]: (9, 2)
Run Code Online (Sandbox Code Playgroud)
3)创建表示起始字符串的一维数字数组:
In [25]: padv = np.full(W,48,dtype=np.uint8)
...: a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
In [27]: a0
Out[27]: array([115, 116, 114, 95, 48, 48], dtype=uint8)
Run Code Online (Sandbox Code Playgroud)
4)扩展到覆盖作为2D阵列的字符串范围:
In [33]: a1 = np.repeat(a0[None],N,axis=0)
...: r = np.arange(start, stop)
...: addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
...: a1[:,len(prefix_str):] += addn.astype(a1.dtype)
In [34]: a1
Out[34]:
array([[115, 116, 114, 95, 49, 53],
[115, 116, 114, 95, 49, 54],
[115, 116, 114, 95, 49, 55],
[115, 116, 114, 95, 49, 56],
[115, 116, 114, 95, 49, 57],
[115, 116, 114, 95, 50, 48],
[115, 116, 114, 95, 50, 49],
[115, 116, 114, 95, 50, 50],
[115, 116, 114, 95, 50, 51]], dtype=uint8)
Run Code Online (Sandbox Code Playgroud)
5)因此,每行代表一个字符串的ascii等价物,每个字符串都与所需的输出相关.让我们在最后一步得到它:
In [35]: a1.view('S'+str(a1.shape[1])).ravel()
Out[35]:
array(['str_15', 'str_16', 'str_17', 'str_18', 'str_19', 'str_20',
'str_21', 'str_22', 'str_23'],
dtype='|S6')
Run Code Online (Sandbox Code Playgroud)
这是一个针对列表理解版本的快速时序测试,似乎是最好地查看其他帖子的时间 -
In [339]: N = 10000
In [340]: %timeit ['str_%s'%i for i in range(N)]
1000 loops, best of 3: 1.12 ms per loop
In [341]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
1000 loops, best of 3: 490 µs per loop
In [342]: N = 100000
In [343]: %timeit ['str_%s'%i for i in range(N)]
100 loops, best of 3: 14 ms per loop
In [344]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
100 loops, best of 3: 4 ms per loop
Run Code Online (Sandbox Code Playgroud)
在Python-3上,要获取字符串dtype数组,我们需要在中间int dtype数组上填充更多的零.因此,没有和用于Python-3的numexpr版本的等价物最终成为这些方面的东西 -
方法#1(无numexpr):
def create_inc_pattern(prefix_str, start, stop):
N = stop - start # count of numbers
W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
dl = len(prefix_str)+W # datatype length
dt = np.uint8 # int datatype for string to-from conversion
padv = np.full(W,48,dtype=np.uint8)
a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
r = np.arange(start, stop)
addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
a1 = np.repeat(a0[None],N,axis=0)
a1[:,len(prefix_str):] += addn.astype(dt)
a1.shape = (-1)
a2 = np.zeros((len(a1),4),dtype=dt)
a2[:,0] = a1
return np.frombuffer(a2.ravel(), dtype='U'+str(dl))
Run Code Online (Sandbox Code Playgroud)
方法#2(使用numexpr):
import numexpr as ne
def create_inc_pattern_numexpr(prefix_str, start, stop):
N = stop - start # count of numbers
W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
dl = len(prefix_str)+W # datatype length
dt = np.uint8 # int datatype for string to-from conversion
padv = np.full(W,48,dtype=np.uint8)
a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
r = np.arange(start, stop)
r2D = r[:,None]
s = 10**np.arange(W-1,-1,-1)
addn = ne.evaluate('(r2D/s)%10')
a1 = np.repeat(a0[None],N,axis=0)
a1[:,len(prefix_str):] += addn.astype(dt)
a1.shape = (-1)
a2 = np.zeros((len(a1),4),dtype=dt)
a2[:,0] = a1
return np.frombuffer(a2.ravel(), dtype='U'+str(dl))
Run Code Online (Sandbox Code Playgroud)
计时 -
In [8]: N = 100000
In [9]: %timeit ['str_%s'%i for i in range(N)]
100 loops, best of 3: 18.5 ms per loop
In [10]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
100 loops, best of 3: 6.06 ms per loop
Run Code Online (Sandbox Code Playgroud)
一种可能的解决方案是将值转换为strings by map:
df['New_Column'] = np.arange(len(df['a']))+1
df['New_Column'] = 'str_' + df['New_Column'].map(str)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
981 次 |
| 最近记录: |