按最大大小将numpy数组拆分为多个块

jpm*_*c26 4 python numpy

我有一些非常大的二维numpy数组。一个数据集是55732 x 257659，超过140亿个元素。因为我需要执行throws MemoryError的一些操作，所以我想尝试将数组拆分为一定大小的块，然后将它们针对这些块运行。（我可以在每个片段上运行该操作之后汇总结果。）我的问题所在的事实MemoryErrors意味着，重要的是我可以以某种方式限制数组的大小，而不是将它们拆分为固定数量的片段。

例如，让我们生成一个1009 x 1009随机数组：

a = numpy.random.choice([1,2,3,4], (1009,1009))

Run Code Online (Sandbox Code Playgroud)

我的数据没有必要进行均分，并且绝对不能保证可以按我想要的大小进行分割。所以我选择1009是因为它是主要的。

还要说我希望它们以不大于50 x 50的块的形式出现。由于这只是为了避免极大数组的错误，因此如果结果不准确也可以。

如何将其拆分为所需的块？

我正在使用numpy 1.14.3（最新）的Python 3.6 64位。

有关

我已经看到了使用的函数reshape，但是如果行数和列数未完全划分大小，则该函数将无效。

这个问题（以及其他类似问题）的答案解释了如何拆分为一定数量的块，但这并未说明如何拆分为一定的大小。

我也看到了这个问题，因为这实际上是我的确切问题。答案和评论建议切换到64位（我已经拥有）并使用numpy.memmap。都没有帮助。

可以这样做，以使所得阵列的形状略小于所需的最大值，或者使它们具有恰好所需的最大值，但末端要有一些余量。

基本逻辑是计算用于拆分数组的参数，然后用于array_split沿数组的每个轴（或维度）拆分数组。

我们需要numpy和math模块以及示例数组：

import math
import numpy

a = numpy.random.choice([1,2,3,4], (1009,1009))

Run Code Online (Sandbox Code Playgroud)

略小于最大值

逻辑

首先将最终的分块大小的形状沿每个要存储在元组中的维度存储：

chunk_shape = (50, 50)

Run Code Online (Sandbox Code Playgroud)

array_split一次只能沿一个轴（或维度）或数组拆分。因此，让我们从第一个轴开始。

计算将数组拆分为以下部分的数量：
```
num_sections = math.ceil(a.shape[0] / chunk_shape[0])
```
Run Code Online (Sandbox Code Playgroud)
在我们的示例中，这是21（1009 / 50 = 20.18）。
现在将其拆分：
```
first_split = numpy.array_split(a, num_sections, axis=0)
```
Run Code Online (Sandbox Code Playgroud)
这为我们提供了21个（请求的节数）numpy数组的列表，这些数组被拆分为第一维不大于50：
```
print(len(first_split))
# 21
print({i.shape for i in first_split})
# {(48, 1009), (49, 1009)}
# These are the distinct shapes, so we don't see all 21 separately
```
Run Code Online (Sandbox Code Playgroud)
在这种情况下，沿该轴分别为48和49。

对于第二维，我们可以对每个新数组执行相同的操作：

num_sections = math.ceil(a.shape[1] / chunk_shape[1])
second_split = [numpy.array_split(a2, num_sections, axis=1) for a2 in first_split]

Run Code Online (Sandbox Code Playgroud)

这给了我们一个列表清单。每个子列表包含所需大小的numpy数组：

print(len(second_split))
# 21
print({len(i) for i in second_split})
# {21}
# All sublists are 21 long
print({i2.shape for i in second_split for i2 in i})
# {(48, 49), (49, 48), (48, 48), (49, 49)}
# Distinct shapes

Run Code Online (Sandbox Code Playgroud)

全功能

我们可以使用递归函数将其实现为任意维度：

def split_to_approx_shape(a, chunk_shape, start_axis=0):
    if len(chunk_shape) != len(a.shape):
        raise ValueError('chunk length does not match array number of axes')

    if start_axis == len(a.shape):
        return a

    num_sections = math.ceil(a.shape[start_axis] / chunk_shape[start_axis])
    split = numpy.array_split(a, num_sections, axis=start_axis)
    return [split_to_approx_shape(split_a, chunk_shape, start_axis + 1) for split_a in split]

Run Code Online (Sandbox Code Playgroud)

我们这样称呼它：

full_split = split_to_approx_shape(a, (50,50))
print({i2.shape for i in full_split for i2 in i})
# {(48, 49), (49, 48), (48, 48), (49, 49)}
# Distinct shapes

Run Code Online (Sandbox Code Playgroud)

确切的形状加上余数

逻辑

如果我们想成为一个幻想者，并且除了尾随的剩余数组之外，将所有新数组都精确地指定为大小，则可以通过传递要分割到的索引列表来做到这一点array_split。

首先建立索引数组：

axis = 0
split_indices = [chunk_shape[axis]*(i+1) for i  in range(math.floor(a.shape[axis] / chunk_shape[axis]))]

Run Code Online (Sandbox Code Playgroud)

这提供了一个索引列表，每个索引从最后一个开始为50：

print(split_indices)
# [50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000]

Run Code Online (Sandbox Code Playgroud)

然后拆分：

first_split = numpy.array_split(a, split_indices, axis=0)
print(len(first_split))
# 21
print({i.shape for i in first_split})
# {(9, 1009), (50, 1009)}
# Distinct shapes, so we don't see all 21 separately
print((first_split[0].shape, first_split[1].shape, '...', first_split[-2].shape, first_split[-1].shape))
# ((50, 1009), (50, 1009), '...', (50, 1009), (9, 1009))

Run Code Online (Sandbox Code Playgroud)

然后再次针对第二个轴：

axis = 1
split_indices = [chunk_shape[axis]*(i+1) for i  in range(math.floor(a.shape[axis] / chunk_shape[axis]))]
second_split = [numpy.array_split(a2, split_indices, axis=1) for a2 in first_split]
print({i2.shape for i in second_split for i2 in i})
# {(9, 50), (9, 9), (50, 9), (50, 50)}

Run Code Online (Sandbox Code Playgroud)

全功能

调整递归函数：

def split_to_shape(a, chunk_shape, start_axis=0):
    if len(chunk_shape) != len(a.shape):
        raise ValueError('chunk length does not match array number of axes')

    if start_axis == len(a.shape):
        return a

    split_indices = [
        chunk_shape[start_axis]*(i+1)
        for i in range(math.floor(a.shape[start_axis] / chunk_shape[start_axis]))
    ]
    split = numpy.array_split(a, split_indices, axis=start_axis)
    return [split_to_shape(split_a, chunk_shape, start_axis + 1) for split_a in split]

Run Code Online (Sandbox Code Playgroud)

我们称之为完全相同的方式：

full_split = split_to_shape(a, (50,50))
print({i2.shape for i in full_split for i2 in i})
# {(9, 50), (9, 9), (50, 9), (50, 50)}
# Distinct shapes

Run Code Online (Sandbox Code Playgroud)

额外注意事项

性能

这些功能似乎非常快。使用以下任一功能，我都可以在0.05秒内将示例数组（包含140亿个元素）拆分为1000个乘以1000个形状的块（导致超过14000个新数组）：

print('Building test array')
a = numpy.random.randint(4, size=(55000, 250000), dtype='uint8')
chunks = (1000, 1000)
numtests = 1000
print('Running {} tests'.format(numtests))
print('split_to_approx_shape: {} seconds'.format(timeit.timeit(lambda: split_to_approx_shape(a, chunks), number=numtests) / numtests))
print('split_to_shape: {} seconds'.format(timeit.timeit(lambda: split_to_shape(a, chunks), number=numtests) / numtests))

Run Code Online (Sandbox Code Playgroud)

输出：

Building test array
Running 1000 tests
split_to_approx_shape: 0.035109398348040485 seconds
split_to_shape: 0.03113800323300747 seconds

Run Code Online (Sandbox Code Playgroud)

我没有使用更高维度的数组测试速度。

形状小于最大值

如果任何尺寸的大小小于指定的最大值，则这两个功能都将正常工作。这不需要特殊的逻辑。

归档时间：	7 年，6 月前
查看次数：	2285 次
最近记录：	6 年，1 月前

使用Numpy将Paritition阵列分成N个块 48

将2d阵列切成较小的2d阵列 45

如何克服 numpy.unique 的 MemoryError 5

更多相关链接

为什么全局变量是邪恶的？ 100

Python中的"私有"(实现)类 97

如何在Windows上使用python 3.4 pip？ 45

Unexpected result with += on NumPy arrays 24

Matplotlib imshow偏移匹配轴？ 15

Numpy将输入数组作为`out`参数传递给ufunc 15

什么决定了numpy中int的大小？ 10

替换已弃用的tsplot 9

Python(Numpy)数组排序 8

使用多个给定值在Python中创建一个掩码数组 5

如何有效地配对袜子？ 3850

如何异步上传文件？ 2841

舍入到最多2位小数(仅在必要时) 2492

在git中推送提交时,src refspec master与any不匹配 2472

垂直对齐图像旁边的文字？ 1881

endsWith在JavaScript中 1085

Visual Studio中的构建解决方案,重建解决方案和清洁解决方案之间的区别？ 1081

如何配置Visual Studio代码以始终在新选项卡中打开文件？ 1078

我怎么知道分支是否已经合并为主分支？ 1077

"静态"在C中意味着什么？ 1062