NumPy数组中每个字符串的长度

Ped*_*roA 11 python numpy python-3.x

在NumPy中是否有任何内置操作返回数组中每个字符串的长度?

我认为任何NumPy字符串操作都不会这样做,这是正确的吗?

我可以用for循环来做,但也许有更高效的东西?

import numpy as np
arr = np.array(['Hello', 'foo', 'and', 'whatsoever'], dtype='S256')

sizes = []
for i in arr:
    sizes.append(len(i))

print(sizes)
[5, 3, 3, 10]
Run Code Online (Sandbox Code Playgroud)

Jay*_*ikh 15

你可以使用vectorizenumpy.它要快得多.

mylen = np.vectorize(len)
print mylen(arr)
Run Code Online (Sandbox Code Playgroud)

  • 在我的时间中,"mylen"明显慢于这个小例子数组的列表理解,而对于1000x大的数组来说,它的速度要快得多.`vectorize`不保证速度.它确实可以更轻松地迭代多维数组的所有元素. (5认同)

Pau*_*zer 7

这是几种方法的比较.

观察:

  • 对于大于100行的输入大小,视图广播+ argmin始终如一且大幅度提高.
  • Python解决方案首先将数组转换为列表.
  • map 列表理解
  • np.frompyfunc并且在较小程度上np.vectorize比他们的声誉更好

.

method ??                  size ??  |     10|    100|   1000|  10000| 100000|1000000
------------------------------------+-------+-------+-------+-------+-------+-------
np.char.str_len                     |  0.005|  0.036|  0.313|  3.170| 30.698|309.058
list comprehension                  |  0.005|  0.029|  0.283|  2.812| 29.588|273.618
list comprehension after .tolist()  |  0.002|  0.011|  0.109|  1.155| 12.888|133.759
map                                 |  0.002|  0.008|  0.074|  0.825|  9.386|103.074
np.frompyfunc                       |  0.004|  0.010|  0.081|  0.892|  7.985| 81.841
np.vectorize                        |  0.024|  0.030|  0.115|  1.070| 11.557|124.228
viewcast after zero padding         |  0.005|  0.006|  0.034|  0.298|  3.379| 35.487
viewcast                            |  0.010|  0.011|  0.037|  0.280|  2.886| 32.954
Run Code Online (Sandbox Code Playgroud)

码:

import numpy as np

flist = []
def timeme(name):
    def wrap_gen(f):
        flist.append((name, f))
        return(f)
    return wrap_gen

@timeme("np.char.str_len")
def np_char():
    return np.char.str_len(A)

@timeme("list comprehension")
def lst_cmp():
    return [len(a) for a in A]

@timeme("list comprehension after .tolist()")
def lst_cmp_opt():
    return [len(a) for a in A.tolist()]

@timeme("map")
def map_():
    return list(map(len, A.tolist()))

@timeme("np.frompyfunc")
def np_fpf():
    return np.frompyfunc(len, 1, 1)(A)

@timeme("np.vectorize")
def np_vect():
    return np.vectorize(len)(A)

@timeme("viewcast after zero padding")
def np_zt():
    N = A.dtype.itemsize//4
    return A.astype(f'U{N+1}').view(np.uint32).reshape(-1, N+1).argmin(1)

@timeme("viewcast")
def np_view():
    v = A.view(np.uint32).reshape(A.size, -1)
    l = np.argmin(v, 1)
    l[v[np.arange(len(v)), l] > 0] = v.shape[-1]
    return l

A = np.random.choice(
    "Blindtext do not use the quick brown fox jumps over the lazy dog".split(),
    1000000)

for _, f in flist[:-1]:
    assert (f()==flist[-1][1]()).all()

from timeit import timeit

L = ['|+' + len(flist)*'|',
     [f"{'method ??                  size ??':36s}", 36*'-']
     + [f"{name:36s}" for name, f in flist]]
for N in (10, 100, 1000, 10000, 100000, 1000000):
    A = np.random.choice("Blindtext do not use the quick brown fox jumps"
                         " over the lazy dog".split(), N)
    L.append([f"{N:>7d}", 7*'-']
             + [f"{timeit(f, number=10)*100:7.3f}" for name, f in flist])
for sep, *line in zip(*L):
    print(*line, sep=sep)
Run Code Online (Sandbox Code Playgroud)


小智 6

对我来说,这就是要走的路:

sizes = [len(i) for i in arr]
Run Code Online (Sandbox Code Playgroud)


小智 6

使用str_lenNumpy

sizes = np.char.str_len(arr)
Run Code Online (Sandbox Code Playgroud)

str_len 文档:https ://numpy.org/devdocs/reference/generated/numpy.char.str_len.html