内存高效的Python批处理

Question

内存高效的Python批处理

use*_*929 5 python memory-management numpy memory-profiling spectral

题

我写了一个小的python批处理器,它加载二进制数据,执行numpy操作并存储结果.它消耗的内存比它应该多得多.我查看了类似的堆栈溢出讨论,并想要求进一步的建议.

背景

我将光谱数据转换为rgb.光谱数据存储在线段交错(BIL)图像文件中.这就是我逐行读取和处理数据的原因.我使用Spectral Python Library读取数据,它返回一个numpy数组.hyp是大型光谱文件的描述符:hyp.ncols = 1600,hyp.nrows = 3430,hyp.nbands = 160

码

import spectral
import numpy as np
import scipy

class CIE_converter (object):
   def __init__(self, cie):
       self.cie = cie

    def interpolateBand_to_cie_range(self, hyp, hyp_line):
       interp = scipy.interpolate.interp1d(hyp.bands.centers,hyp_line, kind='cubic',bounds_error=False, fill_value=0)
       return interp(self.cie[:,0])

    #@profile
    def spectrum2xyz(self, hyp):
       out = np.zeros((hyp.ncols,hyp.nrows,3))
       spec_line = hyp.read_subregion((0,1), (0,hyp.ncols)).squeeze()
       spec_line_int = self.interpolateBand_to_cie_range(hyp, spec_line)
       for ii in xrange(hyp.nrows):
          spec_line = hyp.read_subregion((ii,ii+1), (0,hyp.ncols)).squeeze()
          spec_line_int = self.interpolateBand_to_cie_range(hyp,spec_line)
          out[:,ii,:] = np.dot(spec_line_int,self.cie[:,1:4])
       return out

Run Code Online (Sandbox Code Playgroud)

记忆消耗

所有大数据都在循环外初始化.我天真的解释是内存消耗不应该增加(我是否使用了太多的Matlab？)有人能解释一下增加因子10吗？这不是线性的,因为hyp.nrows = 3430.是否有任何改进内存管理的建议？

  Line #    Mem usage    Increment   Line Contents
  ================================================
  76                                 @profile
  77     60.53 MB      0.00 MB       def spectrum2xyz(self, hyp):
  78    186.14 MB    125.61 MB           out = np.zeros((hyp.ncols,hyp.nrows,3))
  79    186.64 MB      0.50 MB           spec_line = hyp.read_subregion((0,1), (0,hyp.ncols)).squeeze()
  80    199.50 MB     12.86 MB           spec_line_int = self.interpolateBand_to_cie_range(hyp, spec_line)
  81                             
  82   2253.93 MB   2054.43 MB           for ii in xrange(hyp.nrows):
  83   2254.41 MB      0.49 MB               spec_line = hyp.read_subregion((ii,ii+1), (0,hyp.ncols)).squeeze()
  84   2255.64 MB      1.22 MB               spec_line_int = self.interpolateBand_to_cie_range(hyp, spec_line)
  85   2235.08 MB    -20.55 MB               out[:,ii,:] = np.dot(spec_line_int,self.cie[:,1:4])
  86   2235.08 MB      0.00 MB           return out

Run Code Online (Sandbox Code Playgroud)

笔记

我用xrange取代了范围而没有大幅提升.我知道立方插值不是最快的,但这与CPU消耗无关.

Answer 1

use*_*929 1

感谢您的评论。它们都帮助我稍微改善了内存消耗。但最终我弄清楚了内存消耗的主要原因是什么：

SpectralPython Images 包含一个Numpy Memmap对象。它具有与高光谱数据立方体的数据结构相同的格式。（如果是 BIL 格式（nrows、nbands、ncols））调用时：

spec_line = hyp.read_subregion((ii,ii+1), (0,hyp.ncols)).squeeze()

Run Code Online (Sandbox Code Playgroud)

图像不仅作为 numpy 数组返回值返回，而且还缓存在 hyp.memmap 中。第二次调用会更快，但在我的情况下，内存只会增加，直到操作系统抱怨为止。由于 memmap 实际上是一个很好的实现，我将在未来的工作中直接利用它。

归档时间：	12 年，11 月前
查看次数：	1228 次
最近记录：	12 年，11 月前