巨大数据集的Python双重自由错误

Avn*_*esh 31 python memory numpy scipy

我在Python中有一个非常简单的脚本,但由于某种原因,在运行大量数据时出现以下错误:

*** glibc detected *** python: double free or corruption (out): 0x00002af5a00cc010 ***
Run Code Online (Sandbox Code Playgroud)

当我试图释放已经释放的内存时,我习惯于在C或C++中出现这些错误.但是,通过我对Python的理解(特别是我编写代码的方式),我真的不明白为什么会这样.

这是代码:

#!/usr/bin/python -tt                                                                                                                                                                                                                         

import sys, commands, string
import numpy as np
import scipy.io as io
from time import clock

W = io.loadmat(sys.argv[1])['W']
size = W.shape[0]
numlabels = int(sys.argv[2])
Q = np.zeros((size, numlabels), dtype=np.double)
P = np.zeros((size, numlabels), dtype=np.double)
Q += 1.0 / Q.shape[1]
nu = 0.001
mu = 0.01
start = clock()
mat = -nu + mu*(W*(np.log(Q)-1))
end = clock()
print >> sys.stderr, "Time taken to compute matrix: %.2f seconds"%(end-start)
Run Code Online (Sandbox Code Playgroud)

有人可能会问,为什么要声明一个P和一个Q numpy数组?我只是这样做来反映实际情况(因为这段代码只是我实际做的一部分,我需要一个P矩阵并事先声明它).

我可以访问一台192GB的机器,所以我在一个非常大的SciPy稀疏矩阵上测试了这个(220万到220万,但非常稀疏,这不是问题).主存储器由Q,P和mat矩阵占用,因为它们都是2000万个2000矩阵(大小= 220万,numlabels = 2000).峰值内存高达131GB,非常适合内存.在计算mat矩阵时,我得到glibc错误,并且我的进程自动进入睡眠(S)状态,而不释放它占用的131GB.

鉴于奇怪的(对于Python)错误(我没有明确地解除分配任何东西),以及这对于较小的矩阵大小(2000年大约150万)很好地工作的事实,我真的不知道从哪里开始调试它.

作为一个起点,我在跑步前设置了"ulimit -s unlimited",但无济于事.

任何帮助或洞察numpy的行为与真正大量的数据将是受欢迎的.

请注意,这不是内存不足错误 - 我有196GB,我的进程达到131GB左右并保持一段时间,然后再给出错误.

更新:2013年2月16日(太平洋标准时间下午1:10):

根据建议,我使用GDB运行Python.有趣的是,在一次GDB运行中,我忘了将堆栈大小限制设置为"无限制",并得到以下输出:

*** glibc detected *** /usr/bin/python: munmap_chunk(): invalid pointer: 0x00007fe7508a9010 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x733b6)[0x7ffff6ec23b6]
/usr/lib64/python2.7/site-packages/numpy/core/multiarray.so(+0x4a496)[0x7ffff69fc496]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4e67)[0x7ffff7af48c7]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x309)[0x7ffff7af6c49]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalCode+0x32)[0x7ffff7b25592]
/usr/lib64/libpython2.7.so.1.0(+0xfcc61)[0x7ffff7b33c61]
/usr/lib64/libpython2.7.so.1.0(PyRun_FileExFlags+0x84)[0x7ffff7b34074]
/usr/lib64/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0x189)[0x7ffff7b347c9]
/usr/lib64/libpython2.7.so.1.0(Py_Main+0x36c)[0x7ffff7b3e1bc]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7ffff6e6dbfd]
/usr/bin/python[0x4006e9]
======= Memory map: ========
00400000-00401000 r-xp 00000000 09:01 50336181                           /usr/bin/python2.7
00600000-00601000 r--p 00000000 09:01 50336181                           /usr/bin/python2.7
00601000-00602000 rw-p 00001000 09:01 50336181                           /usr/bin/python2.7
00602000-00e5f000 rw-p 00000000 00:00 0                                  [heap]
7fdf2584c000-7ffff0a66000 rw-p 00000000 00:00 0 
7ffff0a66000-7ffff0a6b000 r-xp 00000000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0a6b000-7ffff0c6a000 ---p 00005000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0c6a000-7ffff0c6b000 r--p 00004000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0c6b000-7ffff0c6c000 rw-p 00005000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0c6c000-7ffff0c77000 r-xp 00000000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0c77000-7ffff0e76000 ---p 0000b000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0e76000-7ffff0e77000 r--p 0000a000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0e77000-7ffff0e78000 rw-p 0000b000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0e78000-7ffff0e79000 rw-p 00000000 00:00 0 
7ffff0e79000-7ffff0e9b000 r-xp 00000000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff0e9b000-7ffff109a000 ---p 00022000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff109a000-7ffff109b000 r--p 00021000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff109b000-7ffff109f000 rw-p 00022000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff109f000-7ffff10a0000 rw-p 00000000 00:00 0 
7ffff10a0000-7ffff10a5000 r-xp 00000000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff10a5000-7ffff12a4000 ---p 00005000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff12a4000-7ffff12a5000 r--p 00004000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff12a5000-7ffff12a7000 rw-p 00005000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff12a7000-7ffff12ad000 r-xp 00000000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff12ad000-7ffff14ac000 ---p 00006000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff14ac000-7ffff14ad000 r--p 00005000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff14ad000-7ffff14ae000 rw-p 00006000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff14ae000-7ffff14b5000 r-xp 00000000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff14b5000-7ffff16b4000 ---p 00007000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff16b4000-7ffff16b5000 r--p 00006000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff16b5000-7ffff16b6000 rw-p 00007000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff16b6000-7ffff17c2000 r-xp 00000000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff17c2000-7ffff19c2000 ---p 0010c000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff19c2000-7ffff19c3000 r--p 0010c000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff19c3000-7ffff19c6000 rw-p 0010d000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff19c6000-7ffff19d5000 r-xp 00000000 00:12 54138561                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_dia.so
7ffff19d5000-7ffff1bd4000 ---p 0000f000 00:12 54138561                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_dia.so
7ffff1bd4000-7ffff1bd5000 r--p 0000e000 00:12 54138561                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_dia.so
Program received signal SIGABRT, Aborted.
0x00007ffff6e81ab5 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff6e81ab5 in raise () from /lib64/libc.so.6
#1  0x00007ffff6e82fb6 in abort () from /lib64/libc.so.6
#2  0x00007ffff6ebcdd3 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffff6ec23b6 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007ffff69fc496 in ?? () from /usr/lib64/python2.7/site-packages/numpy/core/multiarray.so
#5  0x00007ffff7af48c7 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
#6  0x00007ffff7af6c49 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.7.so.1.0
#7  0x00007ffff7b25592 in PyEval_EvalCode () from /usr/lib64/libpython2.7.so.1.0
#8  0x00007ffff7b33c61 in ?? () from /usr/lib64/libpython2.7.so.1.0
#9  0x00007ffff7b34074 in PyRun_FileExFlags () from /usr/lib64/libpython2.7.so.1.0
#10 0x00007ffff7b347c9 in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.7.so.1.0
#11 0x00007ffff7b3e1bc in Py_Main () from /usr/lib64/libpython2.7.so.1.0
#12 0x00007ffff6e6dbfd in __libc_start_main () from /lib64/libc.so.6
#13 0x00000000004006e9 in _start ()
Run Code Online (Sandbox Code Playgroud)

当我将堆栈大小限制设置为无限制时,我得到以下内容:

*** glibc detected *** /usr/bin/python: double free or corruption (out): 0x00002abb2732c010 ***
^X^C
Program received signal SIGINT, Interrupt.
0x00002aaaab9d08fe in __lll_lock_wait_private () from /lib64/libc.so.6
(gdb) bt
#0  0x00002aaaab9d08fe in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00002aaaab969f2e in _L_lock_9927 () from /lib64/libc.so.6
#2  0x00002aaaab9682d1 in free () from /lib64/libc.so.6
#3  0x00002aaaaaabbfe2 in _dl_scope_free () from /lib64/ld-linux-x86-64.so.2
#4  0x00002aaaaaab70a4 in _dl_map_object_deps () from /lib64/ld-linux-x86-64.so.2
#5  0x00002aaaaaabcaa0 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#6  0x00002aaaaaab85f6 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#7  0x00002aaaaaabc5da in _dl_open () from /lib64/ld-linux-x86-64.so.2
#8  0x00002aaaab9fb530 in do_dlopen () from /lib64/libc.so.6
#9  0x00002aaaaaab85f6 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#10 0x00002aaaab9fb5cf in dlerror_run () from /lib64/libc.so.6
#11 0x00002aaaab9fb637 in __libc_dlopen_mode () from /lib64/libc.so.6
#12 0x00002aaaab9d60c5 in init () from /lib64/libc.so.6
#13 0x00002aaaab080933 in pthread_once () from /lib64/libpthread.so.0
#14 0x00002aaaab9d61bc in backtrace () from /lib64/libc.so.6
#15 0x00002aaaab95dde7 in __libc_message () from /lib64/libc.so.6
#16 0x00002aaaab9633b6 in malloc_printerr () from /lib64/libc.so.6
#17 0x00002aaaab9682dc in free () from /lib64/libc.so.6
#18 0x00002aaaabef1496 in ?? () from /usr/lib64/python2.7/site-packages/numpy/core/multiarray.so
#19 0x00002aaaaad888c7 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
#20 0x00002aaaaad8ac49 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.7.so.1.0
#21 0x00002aaaaadb9592 in PyEval_EvalCode () from /usr/lib64/libpython2.7.so.1.0
#22 0x00002aaaaadc7c61 in ?? () from /usr/lib64/libpython2.7.so.1.0
#23 0x00002aaaaadc8074 in PyRun_FileExFlags () from /usr/lib64/libpython2.7.so.1.0
#24 0x00002aaaaadc87c9 in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.7.so.1.0
#25 0x00002aaaaadd21bc in Py_Main () from /usr/lib64/libpython2.7.so.1.0
#26 0x00002aaaab90ebfd in __libc_start_main () from /lib64/libc.so.6
#27 0x00000000004006e9 in _start ()
Run Code Online (Sandbox Code Playgroud)

这让我相信基本的问题是numpy多阵列核心模块(第一个输出中的第4行和第二个中的第18行).为了以防万一,我会把它作为numpy和scipy的bug报告提出来.

谁看过这个吗?

更新:2013年2月17日(太平洋标准时间下午4:45)

我发现了一台可以运行代码的机器,它有更新版本的SciPy(0.11)和NumPy(1.7.0).直接运行代码(没有GDB)导致seg错误,没有任何输出到stdout或stderr.再次通过GDB运行,我得到以下内容:

Program received signal SIGSEGV, Segmentation fault.
0x00002aaaabead970 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00002aaaabead970 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00002aaaac5fcd04 in PyDataMem_FREE (ptr=<optimized out>, $K8=<optimized out>) at numpy/core/src/multiarray/multiarraymodule.c:3510
#2  array_dealloc (self=0xc00ab7edbfc228fe) at numpy/core/src/multiarray/arrayobject.c:416
#3  0x0000000000498eac in PyEval_EvalFrameEx ()
#4  0x000000000049f1c0 in PyEval_EvalCodeEx ()
#5  0x00000000004a9081 in PyRun_FileExFlags ()
#6  0x00000000004a9311 in PyRun_SimpleFileExFlags ()
#7  0x00000000004aa8bd in Py_Main ()
#8  0x00002aaaabe4f76d in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#9  0x000000000041b9b1 in _start ()
Run Code Online (Sandbox Code Playgroud)

我知道这不如用调试符号编译的NumPy有用,我会尝试这样做并稍后发布输出.

Avn*_*esh 6

在Numpy Github页面(https://github.com/numpy/numpy/issues/2995)上讨论了同样的问题后,我注意到Numpy/Scipy不会支持如此大量的非零值在得到的稀疏矩阵中.

基本上,W是稀疏矩阵,Q(或np.log(Q)-1)是密集矩阵.当将密集矩阵与稀疏矩阵相乘时,得到的乘积也将以稀疏矩阵形式表示(这很有意义).但请注意,由于我的W矩阵中没有零行,因此得到的乘积W*(np.log(Q)-1)nnz > 2^31(220万乘以2000),这超出了当前Scipy版本中稀疏矩阵中的最大元素数.

在这个阶段,我不确定如何让这个工作,除非用另一种语言重新实现.也许它仍然可以在Python中完成,但是编写C++和Eigen实现可能更好.

特别感谢光伏.帮助解决确切的问题,并感谢其他人的头脑风暴!