为什么复制> = 16 GB的Numpy数组会将其所有元素设置为0?

1''*_*1'' 12 python numpy intel-mkl

在我的Anaconda Python发行版上,复制一个16 GB或更大的Numpy数组(无论dtype如何)都会将副本的所有元素设置为0:

>>> np.arange(2 ** 31 - 1).copy()  # works fine
array([         0,          1,          2, ..., 2147483644, 2147483645,
       2147483646])
>>> np.arange(2 ** 31).copy()  # wait, what?!
array([0, 0, 0, ..., 0, 0, 0])
>>> np.arange(2 ** 32 - 1, dtype=np.float32).copy()
array([  0.00000000e+00,   1.00000000e+00,   2.00000000e+00, ...,
         4.29496730e+09,   4.29496730e+09,   4.29496730e+09], dtype=float32)
>>> np.arange(2 ** 32, dtype=np.float32).copy()
array([ 0.,  0.,  0., ...,  0.,  0.,  0.], dtype=float32)
Run Code Online (Sandbox Code Playgroud)

这是np.__config__.show()为了这个分布:

blas_opt_info:
    library_dirs = ['/users/username/.anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/users/username/.anaconda3/include']
    libraries = ['mkl_rt', 'pthread']
lapack_opt_info:
    library_dirs = ['/users/username/.anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/users/username/.anaconda3/include']
    libraries = ['mkl_rt', 'pthread']
mkl_info:
    library_dirs = ['/users/username/.anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/users/username/.anaconda3/include']
    libraries = ['mkl_rt', 'pthread']
openblas_lapack_info:
  NOT AVAILABLE
lapack_mkl_info:
    library_dirs = ['/users/username/.anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/users/username/.anaconda3/include']
    libraries = ['mkl_rt', 'pthread']
blas_mkl_info:
    library_dirs = ['/users/username/.anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/users/username/.anaconda3/include']
    libraries = ['mkl_rt', 'pthread']
Run Code Online (Sandbox Code Playgroud)

为了比较,这里是np.__config__.show()我的系统Python发行版,它没有这个问题:

blas_opt_info:
    define_macros = [('HAVE_CBLAS', None)]
    libraries = ['openblas', 'openblas']
    language = c
    library_dirs = ['/usr/local/lib']
openblas_lapack_info:
    define_macros = [('HAVE_CBLAS', None)]
    libraries = ['openblas', 'openblas']
    language = c
    library_dirs = ['/usr/local/lib']
openblas_info:
    define_macros = [('HAVE_CBLAS', None)]
    libraries = ['openblas', 'openblas']
    language = c
    library_dirs = ['/usr/local/lib']
lapack_opt_info:
    define_macros = [('HAVE_CBLAS', None)]
    libraries = ['openblas', 'openblas']
    language = c
    library_dirs = ['/usr/local/lib']
blas_mkl_info:
  NOT AVAILABLE
Run Code Online (Sandbox Code Playgroud)

我想知道MKL加速是否是问题.我已经在Python 2和3上重现了这个bug.

MSe*_*ert 4

这只是一个猜测。目前我没有任何证据支持以下说法,但我的猜测是这是一个简单的溢出问题:

>>> np.arange(2 ** 31 - 1).size
2147483647
Run Code Online (Sandbox Code Playgroud)

这恰好是最大值int32

>>> np.iinfo(np.int32)
iinfo(min=-2147483648, max=2147483647, dtype=int32)
Run Code Online (Sandbox Code Playgroud)

因此,当您实际上有一个大小为2147483648( 2**31) 的数组并使用 int32 时,这会溢出并给出实际的负值。那么方法内部大概是这样的numpy.ndarray.copy

for (i = 0 ; i < size ; i ++) {
    newarray[i] = oldarray[i]
}
Run Code Online (Sandbox Code Playgroud)

但考虑到大小现在为负,循环将不会执行,因为0 > -2147483648.

新数组实际上是用零初始化的,这很奇怪,因为在复制数组之前实际放置零是没有意义的(但可能类似于这个问题)。

再次强调:目前这只是猜测,但它与行为相匹配。