增长numpy数值数组的最快方法

Question

增长numpy数值数组的最快方法

要求:

我需要从数据中增加一个任意大的数组.
我可以猜测大小(大约100-200),但不能保证阵列每次都适合
一旦它生长到它的最终大小,我需要对它进行数值计算,所以我宁愿最终得到一个2-D numpy数组.
速度至关重要.例如,对于300个文件中的一个,update()方法被称为4500万次(大约需要150s),而finalize()方法被称为500k次(总共106s)...总共250s或者.

这是我的代码:

def __init__(self):
    self.data = []

def update(self, row):
    self.data.append(row)

def finalize(self):
    dx = np.array(self.data)

Run Code Online (Sandbox Code Playgroud)

我试过的其他事情包括以下代码......但这是waaaaay慢.

def class A:
    def __init__(self):
        self.data = np.array([])

    def update(self, row):
        np.append(self.data, row)

    def finalize(self):
        dx = np.reshape(self.data, size=(self.data.shape[0]/5, 5))

Run Code Online (Sandbox Code Playgroud)

以下是如何调用此示意图的示意图:

for i in range(500000):
    ax = A()
    for j in range(200):
         ax.update([1,2,3,4,5])
    ax.finalize()
    # some processing on ax

Run Code Online (Sandbox Code Playgroud)

Answer 1

Owe*_*wen 77

我尝试了几个不同的东西,时间.

import numpy as np

Run Code Online (Sandbox Code Playgroud)

你提到的方法很慢:(32.094秒)

class A:

    def __init__(self):
        self.data = np.array([])

    def update(self, row):
        self.data = np.append(self.data, row)

    def finalize(self):
        return np.reshape(self.data, newshape=(self.data.shape[0]/5, 5))

Run Code Online (Sandbox Code Playgroud)

常规ol Python列表:(0.308秒)

class B:

    def __init__(self):
        self.data = []

    def update(self, row):
        for r in row:
            self.data.append(r)

    def finalize(self):
        return np.reshape(self.data, newshape=(len(self.data)/5, 5))

Run Code Online (Sandbox Code Playgroud)

试图在numpy中实现一个arraylist:(0.362秒)

class C:

    def __init__(self):
        self.data = np.zeros((100,))
        self.capacity = 100
        self.size = 0

    def update(self, row):
        for r in row:
            self.add(r)

    def add(self, x):
        if self.size == self.capacity:
            self.capacity *= 4
            newdata = np.zeros((self.capacity,))
            newdata[:self.size] = self.data
            self.data = newdata

        self.data[self.size] = x
        self.size += 1

    def finalize(self):
        data = self.data[:self.size]
        return np.reshape(data, newshape=(len(data)/5, 5))

Run Code Online (Sandbox Code Playgroud)

这就是我计时的方式:

x = C()
for i in xrange(100000):
    x.update([i])

Run Code Online (Sandbox Code Playgroud)

所以看起来普通的旧Python列表非常好;)

我认为 60M 更新和 500K 最终调用的比较更清楚。在此示例中，您似乎没有调用 Finalize。 (2认同)
@fodon我实际上确实调用了finalize——每次运行一次（所以我想影响不大）。但这让我觉得也许我误解了你的数据是如何增长的：如果你在更新中获得 60M，我认为这将为下一次最终确定提供至少 60M 数据？ (2认同)
请注意,当内存不足时,第三个选项更加出色.第二种选择需要大量内存.原因是Python的列表是对值的引用数组,而NumPy的数组是实际的值数组. (2认同)

Answer 2

HYR*_*YRY 18

np.append()每次都复制数组中的所有数据,但是list会将容量增加一个因子(1.125).list很快,但内存使用量大于数组.如果你关心内存,你可以使用python标准库的数组模块.

以下是关于此主题的讨论:

如何创建动态数组

有没有办法改变列表增长的因素？ (2认同)
^ 线性（即总累积时间是二次的），而不是指数。 (2认同)

Answer 3

Pra*_*mar 11

使用欧文的帖子中的类声明,这是一个修订的时间,具有最终化的一些效果.

简而言之,我发现C类提供的实现速度比原始帖子中的方法快60多倍.(为文本墙道歉)

我用过的文件:

#!/usr/bin/python
import cProfile
import numpy as np

# ... class declarations here ...

def test_class(f):
    x = f()
    for i in xrange(100000):
        x.update([i])
    for i in xrange(1000):
        x.finalize()

for x in 'ABC':
    cProfile.run('test_class(%s)' % x)

Run Code Online (Sandbox Code Playgroud)

现在,由此产生的时间:

     903005 function calls in 16.049 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.000    0.000   16.049   16.049 <string>:1(<module>)
100000    0.139    0.000    1.888    0.000 fromnumeric.py:1043(ravel)
  1000    0.001    0.000    0.003    0.000 fromnumeric.py:107(reshape)
100000    0.322    0.000   14.424    0.000 function_base.py:3466(append)
100000    0.102    0.000    1.623    0.000 numeric.py:216(asarray)
100000    0.121    0.000    0.298    0.000 numeric.py:286(asanyarray)
  1000    0.002    0.000    0.004    0.000 test.py:12(finalize)
     1    0.146    0.146   16.049   16.049 test.py:50(test_class)
     1    0.000    0.000    0.000    0.000 test.py:6(__init__)
100000    1.475    0.000   15.899    0.000 test.py:9(update)
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
100000    0.126    0.000    0.126    0.000 {method 'ravel' of 'numpy.ndarray' objects}
  1000    0.002    0.000    0.002    0.000 {method 'reshape' of 'numpy.ndarray' objects}
200001    1.698    0.000    1.698    0.000 {numpy.core.multiarray.array}
100000   11.915    0.000   11.915    0.000 {numpy.core.multiarray.concatenate}


     208004 function calls in 16.885 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.001    0.001   16.885   16.885 <string>:1(<module>)
  1000    0.025    0.000   16.508    0.017 fromnumeric.py:107(reshape)
  1000    0.013    0.000   16.483    0.016 fromnumeric.py:32(_wrapit)
  1000    0.007    0.000   16.445    0.016 numeric.py:216(asarray)
     1    0.000    0.000    0.000    0.000 test.py:16(__init__)
100000    0.068    0.000    0.080    0.000 test.py:19(update)
  1000    0.012    0.000   16.520    0.017 test.py:23(finalize)
     1    0.284    0.284   16.883   16.883 test.py:50(test_class)
  1000    0.005    0.000    0.005    0.000 {getattr}
  1000    0.001    0.000    0.001    0.000 {len}
100000    0.012    0.000    0.012    0.000 {method 'append' of 'list' objects}
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  1000    0.020    0.000    0.020    0.000 {method 'reshape' of 'numpy.ndarray' objects}
  1000   16.438    0.016   16.438    0.016 {numpy.core.multiarray.array}


     204010 function calls in 0.244 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.000    0.000    0.244    0.244 <string>:1(<module>)
  1000    0.001    0.000    0.003    0.000 fromnumeric.py:107(reshape)
     1    0.000    0.000    0.000    0.000 test.py:27(__init__)
100000    0.082    0.000    0.170    0.000 test.py:32(update)
100000    0.087    0.000    0.088    0.000 test.py:36(add)
  1000    0.002    0.000    0.005    0.000 test.py:46(finalize)
     1    0.068    0.068    0.243    0.243 test.py:50(test_class)
  1000    0.000    0.000    0.000    0.000 {len}
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  1000    0.002    0.000    0.002    0.000 {method 'reshape' of 'numpy.ndarray' objects}
     6    0.001    0.000    0.001    0.000 {numpy.core.multiarray.zeros}

Run Code Online (Sandbox Code Playgroud)

A类被更新破坏,B类被最终化破坏.C类在两者面前都很强大.

Answer 4

Luc*_*chi 5

您用于完成的函数存在很大的性能差异。考虑以下代码：

N=100000
nruns=5

a=[]
for i in range(N):
    a.append(np.zeros(1000))

print "start"

b=[]
for i in range(nruns):
    s=time()
    c=np.vstack(a)
    b.append((time()-s))
print "Timing version vstack ",np.mean(b)

b=[]
for i in range(nruns):
    s=time()
    c1=np.reshape(a,(N,1000))
    b.append((time()-s))

print "Timing version reshape ",np.mean(b)

b=[]
for i in range(nruns):
    s=time()
    c2=np.concatenate(a,axis=0).reshape(-1,1000)
    b.append((time()-s))

print "Timing version concatenate ",np.mean(b)

print c.shape,c2.shape
assert (c==c2).all()
assert (c==c1).all()

Run Code Online (Sandbox Code Playgroud)

使用 concatenate 似乎比第一个版本快两倍，比第二个版本快 10 倍以上。

Timing version vstack  1.5774928093
Timing version reshape  9.67419199944
Timing version concatenate  0.669512557983

Run Code Online (Sandbox Code Playgroud)

Answer 5

kho*_*kho 5

多维 Numpy 数组

添加到Owen和Prashant Kumar 的答案中，这里是一个使用多维 numpy 数组（又名形状）的版本，可以加速 numpy 解决方案的代码。如果您需要经常访问 ( finalize()) 数据，这尤其有用。

版本	普拉尚特·库马尔	行长度=1	行长度=5
A 类 - np.append	2.873秒	2.776秒	0.682秒
B 类 - python 列表	6.693秒	80.868秒	22.012秒
C类——数组列表	0.095秒	0.180秒	0.043秒

该专栏Prashant Kumar是他在我的机器上执行的示例，以进行比较。这row_length=5是最初问题的例子。,的急剧增加python list来自{built-in method numpy.array}，这意味着 numpy 需要更多的时间将多维列表列表转换为相对于一维列表的数组，并在两者具有相同数字条目的情况下重塑它，例如np.array([[1,2,3]*5])与np.array([1]*15).reshape((-1,3))。

这是代码：

import cProfile
import numpy as np

class A:
    def __init__(self,shape=(0,), dtype=float):
        """First item of shape is ingnored, the rest defines the shape"""
        self.data = np.array([], dtype=dtype).reshape((0,*shape[1:]))

    def update(self, row):
        self.data = np.append(self.data, row)

    def finalize(self):
        return self.data
    
    
class B:
    def __init__(self, shape=(0,), dtype=float):
        """First item of shape is ingnored, the rest defines the shape"""
        self.shape = shape
        self.dtype = dtype 
        self.data = []

    def update(self, row):
        self.data.append(row)

    def finalize(self):
        return np.array(self.data, dtype=self.dtype).reshape((-1, *self.shape[1:]))
    
    
class C:
    def __init__(self, shape=(0,), dtype=float):
        """First item of shape is ingnored, the rest defines the shape"""
        self.shape = shape
        self.data = np.zeros((100,*shape[1:]),dtype=dtype)
        self.capacity = 100
        self.size = 0

    def update(self, x):
        if self.size == self.capacity:
            self.capacity *= 4
            newdata = np.zeros((self.capacity,*self.data.shape[1:]))
            newdata[:self.size] = self.data
            self.data = newdata

        self.data[self.size] = x
        self.size += 1

    def finalize(self):
        return self.data[:self.size]
    

def test_class(f):
    row_length = 5
    x = f(shape=(0,row_length))
    for i in range(int(100000/row_length)):
        x.update([i]*row_length)
    for i in range(1000):
        x.finalize()

for x in 'ABC':
    cProfile.run('test_class(%s)' % x)

Run Code Online (Sandbox Code Playgroud)

还有另一个选项可以添加到Luca Fiaschi 的上述帖子中。

b=[]
for i in range(nruns):
    s=time.time()
    c1=np.array(a, dtype=int).reshape((N,1000))
    b.append((time.time()-s))
    
print("Timing version array.reshape ",np.mean(b))

Run Code Online (Sandbox Code Playgroud)

我的计时结果是：

Timing version vstack         0.6863266944885253
Timing version reshape        0.505419111251831
Timing version array.reshape  0.5052066326141358
Timing version concatenate    0.5339600563049316

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，3 月前
查看次数：	62810 次
最近记录：	9 年，8 月前