使用 Pandas 分析超过 20G 的数据帧,内存不足,指定 chunksize 时仍然不起作用

use*_*130 5 python numpy out-of-memory pandas

我有以下代码来分析一个巨大的数据帧文件(22G,超过 200 万行和 3K 列)。我在一个较小的数据框中测试了代码,它运行正常 ( head -1000 hugefile.txt)。但是,当我在巨大的数据帧上运行代码时,它给了我“分段错误”核心转储。它输出一个 core.number 二进制文件。

我做了一些互联网搜索并想出了使用low_memory =False, 并尝试通过定义读取数据帧chunksize=1000, iterator= True,然后 pandas.concat 块,但这仍然给我带来了内存问题(核心转储)。它甚至不会在核心转储之前读取整个文件,因为我测试只是读取文件并打印一些文本。请帮助并告诉我是否有解决方案可以分析这个巨大的文件。

版本

python 版本:3.6.2
numpy 版本:1.13.1
熊猫版本:0.20.3
操作系统:Linux/Unix

脚本

#!/usr/bin/python
import pandas as pd
import numpy as np

path = "/path/hugefile.txt"
data1 = pd.read_csv(path, sep='\t', low_memory=False,chunksize=1000, iterator=True)
data = pd.concat(data1, ignore_index=True)

#######

i=0
marker_keep = 0
marker_remove = 0
while(i<(data.shape[0])):
    j=5 #starts at 6
    missing = 0
    NoNmiss = 0
    while (j < (data.shape[1]-2)):
        if pd.isnull(data.iloc[i,j]) == True:
            missing = missing +1
            j= j+3
        elif ((data.iloc[i,j+1] >=10) & (((data.iloc[i,j+1])/(data.iloc[i,j+2])) > 0.5)):
            NoNmiss = NoNmiss +1
            j=j+3  
        else:
            missing = missing +1
            j= j+3       
    if (NoNmiss/(missing+NoNmiss)) >= 0.5:
        marker_keep = marker_keep + 1
    else: 
        marker_remove = marker_remove +1
    i=i+1


a = str(marker_keep)
b= str(marker_remove)
c = "marker keep: " + a + "; marker remove: " +b
result = open('PyCount_marker_result.txt', 'w')
result.write(c) 
result.close()
Run Code Online (Sandbox Code Playgroud)

示例数据集:

Index   Group   Number1 Number2 DummyCol    sample1.NA  sample1.NA.score    sample1.NA.coverage sample2.NA  sample2.NA.score    sample2.NA.coverage sample3.NA  sample3.NA.score    sample3.NA.coverage
1   group1  13247   13249   Marker  CC  3   1   NA  0   0   NA  0   0
2   group1  13272   13274   Marker  GG  7   6   GG  3   1   GG  3   1
4   group1  13301   13303   Marker  CC  11  12  CC  5   4   CC  5   3
5   group1  13379   13381   Marker  CC  6   5   CC  5   4   CC  5   3
7   group1  13417   13419   Marker  GG  7   6   GG  4   2   GG  5   4
8   group1  13457   13459   Marker  CC  13  15  CC  9   9   CC  11  13
9   group1  13493   13495   Marker  AA  17  21  AA  11  12  AA  11  13
10  group1  13503   13505   Marker  GG  14  17  GG  9   10  GG  13  15
11  group1  13549   13551   Marker  GG  6   5   GG  4   2   GG  6   5
12  group1  13648   13650   Marker  NA  0   0   NA  0   0   NA  0   0
13  group1  13759   13761   Marker  NA  0   0   NA  0   0   NA  0   0
14  group1  13867   13869   Marker  NA  0   0   NA  0   0   NA  0   0
15  group1  13895   13897   Marker  CC  3   1   NA  0   0   NA  0   0
20  group1  14430   14432   Marker  GG  15  18  NA  0   0   GG  5   3
21  group1  14435   14437   Marker  GG  16  20  GG  3   1   GG  4   2
22  group1  14463   14465   Marker  AT  0   24  AA  3   1   TT  4   6
23  group1  14468   14470   Marker  CC  18  23  CC  3   1   CC  6   5
25  group1  14652   14654   Marker  CC  3   8   NA  0   0   CC  3   1
26  group1  14670   14672   Marker  GG  10  11  NA  0   0   NA  0   0
Run Code Online (Sandbox Code Playgroud)

错误信息:

Traceback (most recent call last):
  File "test_script.py", line 8, in <module>
    data = pd.concat(data1, ignore_index=True)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 206, in concat
    copy=copy)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 236, in __init__
    objs = list(objs)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 978, in __next__
    return self.get_chunk()
  File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1042, in get_chunk
    return self.read(nrows=size)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1005, in read
    ret = self._engine.read(nrows)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1748, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10885)
  File "pandas/_libs/parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)
  File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)
  File "pandas/_libs/parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
/opt/gridengine/default/Federation/spool/execd/kcompute030/job_scripts/5883517: line 10: 29990 Segmentation fault      (core dumped) python3.6 test_script.py
Run Code Online (Sandbox Code Playgroud)

mir*_*ulo 5

您根本没有以块的形式处理数据。

使用data1 = pd.read_csv('...', chunksize=10000, iterator=True), data1变为pandas.io.parser.TextFileReader, 这是一个迭代器,可将 CSV 数据的 10000 行块生成为 DataFrame。

但随后pd.concat会消耗整个迭代器,因此尝试将整个 CSV 加载到内存中,从而完全违背了使用chunksizeand的目的iterator

正确使用chunksizeiterator

为了以块的形式处理数据,您必须迭代迭代器提供的实际 DataFrame 块read.csv

data1 = pd.read_csv(path, sep='\t',chunksize=1000, iterator=True)

for chunk in data1:
    # do my processing of DataFrame chunk of 1000 rows here
Run Code Online (Sandbox Code Playgroud)

最小的例子

假设我们有一个 CSV bigdata.txt

A1, A2
B1, B2
C1, C2
D1, D2
E1, E2
Run Code Online (Sandbox Code Playgroud)

我们希望一次处理 1 行(无论出于何种原因)。

chunksize和的错误用法iterator

df_iter = pd.read_csv('bigdata.txt', chunksize=1, iterator=True, header=None)

df = pd.concat(df_iter)
df
##     0    1
## 0  A1   A2
## 1  B1   B2
## 2  C1   C2
## 3  D1   D2
## 4  E1   E2
Run Code Online (Sandbox Code Playgroud)

我们可以看到我们已经将整个 CSV 加载到内存中,尽管 achunksize为 1。

正确使用

df_iter = pd.read_csv('bigdata.txt', chunksize=1, iterator=True, header=None)

for iter_num, chunk in enumerate(df_iter, 1):
    print('Processing iteration {0}'.format(iter_num))
    print(chunk)

##  Processing iteration 1
##      0    1
##  0  A1   A2
##  Processing iteration 2
##      0    1
##  1  B1   B2
##  Processing iteration 3
##      0    1
##  2  C1   C2
##  Processing iteration 4
##      0    1
##  3  D1   D2
##  Processing iteration 5
##      0    1
##  4  E1   E2
Run Code Online (Sandbox Code Playgroud)