如何从文件过大的内存中构建(或预先计算)直方图?

Tho*_*hew 2 python numpy matplotlib

是否有python的图形库,不需要将所有原始数据点存储为numpy数组或列表以绘制直方图?

我有一个太大的内存数据集,我不想使用子采样来减少数据大小.

我正在寻找的是一个可以获取生成器输出的库(每个数据点都来自一个文件,作为a float),并动态构建直方图.

这包括计算bin大小,因为生成器从文件中生成每个数据点.

如果这样的库不存在,我想知道是否numpy能够{bin_1:count_1, bin_2:count_2...bin_x:count_x}从产生的数据点预先计算一个计数器.

数据点作为垂直矩阵保存在选项卡文件中,排列node-node-score如下:

node   node   5.55555
Run Code Online (Sandbox Code Playgroud)

更多信息:

  • 104301133条数据线(目前为止)
  • 我不知道最小值或最大值
  • bin宽度应该相同
  • 箱数可以是1000

试图回答:

low = np.inf
high = -np.inf

# find the overall min/max
chunksize = 1000
loop = 0
for chunk in pd.read_table('gsl_test_1.txt', header=None, chunksize=chunksize, delimiter='\t'):
    low = np.minimum(chunk.iloc[:, 2].min(), low)
    high = np.maximum(chunk.iloc[:, 2].max(), high)
    loop += 1
lines = loop*chunksize

nbins = math.ceil(math.sqrt(lines))   

bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.int64)  # np.ndarray filled with np.uint32 zeros, CHANGED TO int64


# iterate over your dataset in chunks of 1000 lines (increase or decrease this
# according to how much you can hold in memory)
for chunk in pd.read_table('gsl_test_1.txt', header=None, chunksize=2, delimiter='\t'):

    # compute bin counts over the 3rd column
    subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges)  # np.ndarray filled with np.int64

    # accumulate bin counts over chunks
    total += subtotal


plt.hist(bin_edges[:-1], bins=bin_edges, weights=total)
# plt.bar(np.arange(total.shape[0]), total, width=1)
plt.savefig('gsl_test_hist.svg')
Run Code Online (Sandbox Code Playgroud)

输出: 正常dist,mu = 00,sigma = 30

ali*_*i_m 7

您可以迭代数据集的块并用于np.histogram将bin计数累积到单个向量中(您需要先判断bin边缘并将它们传递给np.histogram使用bins=参数),例如:

import numpy as np
import pandas as pd

bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.uint)

# iterate over your dataset in chunks of 1000 lines (increase or decrease this
# according to how much you can hold in memory)
for chunk in pd.read_table('/path/to/my/dataset.txt', header=None, chunksize=1000):

    # compute bin counts over the 3rd column
    subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges)

    # accumulate bin counts over chunks
    total += subtotal.astype(np.uint)
Run Code Online (Sandbox Code Playgroud)

如果你想确保你的容器跨越数组中的所有值,但是你还不知道最小值和最大值,那么你需要事先将它循环一次以计算它们(例如使用np.min/ np.max),例如:

low = np.inf
high = -np.inf

# find the overall min/max
for chunk in pd.read_table('/path/to/my/dataset.txt', header=None, chunksize=1000):
    low = np.minimum(chunk.iloc[:, 2].min(), low)
    high = np.maximum(chunk.iloc[:, 2].max(), high)
Run Code Online (Sandbox Code Playgroud)

获得bin计数数组后,可以直接使用plt.bar以下方法生成条形图:

plt.bar(bin_edges[:-1], total, width=1)
Run Code Online (Sandbox Code Playgroud)

也可以使用weights=参数来plt.hist从计数向量而不是样本生成直方图,例如:

plt.hist(bin_edges[:-1], bins=bin_edges, weights=total, ...)
Run Code Online (Sandbox Code Playgroud)