use*_*130 5 python numpy out-of-memory pandas
我有以下代码来分析一个巨大的数据帧文件(22G,超过 200 万行和 3K 列)。我在一个较小的数据框中测试了代码,它运行正常 ( head -1000 hugefile.txt)。但是,当我在巨大的数据帧上运行代码时,它给了我“分段错误”核心转储。它输出一个 core.number 二进制文件。
我做了一些互联网搜索并想出了使用low_memory =False, 并尝试通过定义读取数据帧chunksize=1000, iterator= True,然后 pandas.concat 块,但这仍然给我带来了内存问题(核心转储)。它甚至不会在核心转储之前读取整个文件,因为我测试只是读取文件并打印一些文本。请帮助并告诉我是否有解决方案可以分析这个巨大的文件。
版本
python 版本:3.6.2
numpy 版本:1.13.1
熊猫版本:0.20.3
操作系统:Linux/Unix
脚本
#!/usr/bin/python
import pandas as pd
import numpy as np
path = "/path/hugefile.txt"
data1 = pd.read_csv(path, sep='\t', low_memory=False,chunksize=1000, iterator=True)
data = pd.concat(data1, ignore_index=True)
#######
i=0
marker_keep = 0
marker_remove = 0
while(i<(data.shape[0])):
j=5 #starts at 6
missing = 0
NoNmiss = 0
while (j < (data.shape[1]-2)):
if pd.isnull(data.iloc[i,j]) == True:
missing = missing +1
j= j+3
elif ((data.iloc[i,j+1] >=10) & (((data.iloc[i,j+1])/(data.iloc[i,j+2])) > 0.5)):
NoNmiss = NoNmiss +1
j=j+3
else:
missing = missing +1
j= j+3
if (NoNmiss/(missing+NoNmiss)) >= 0.5:
marker_keep = marker_keep + 1
else:
marker_remove = marker_remove +1
i=i+1
a = str(marker_keep)
b= str(marker_remove)
c = "marker keep: " + a + "; marker remove: " +b
result = open('PyCount_marker_result.txt', 'w')
result.write(c)
result.close()
Run Code Online (Sandbox Code Playgroud)
示例数据集:
Index Group Number1 Number2 DummyCol sample1.NA sample1.NA.score sample1.NA.coverage sample2.NA sample2.NA.score sample2.NA.coverage sample3.NA sample3.NA.score sample3.NA.coverage
1 group1 13247 13249 Marker CC 3 1 NA 0 0 NA 0 0
2 group1 13272 13274 Marker GG 7 6 GG 3 1 GG 3 1
4 group1 13301 13303 Marker CC 11 12 CC 5 4 CC 5 3
5 group1 13379 13381 Marker CC 6 5 CC 5 4 CC 5 3
7 group1 13417 13419 Marker GG 7 6 GG 4 2 GG 5 4
8 group1 13457 13459 Marker CC 13 15 CC 9 9 CC 11 13
9 group1 13493 13495 Marker AA 17 21 AA 11 12 AA 11 13
10 group1 13503 13505 Marker GG 14 17 GG 9 10 GG 13 15
11 group1 13549 13551 Marker GG 6 5 GG 4 2 GG 6 5
12 group1 13648 13650 Marker NA 0 0 NA 0 0 NA 0 0
13 group1 13759 13761 Marker NA 0 0 NA 0 0 NA 0 0
14 group1 13867 13869 Marker NA 0 0 NA 0 0 NA 0 0
15 group1 13895 13897 Marker CC 3 1 NA 0 0 NA 0 0
20 group1 14430 14432 Marker GG 15 18 NA 0 0 GG 5 3
21 group1 14435 14437 Marker GG 16 20 GG 3 1 GG 4 2
22 group1 14463 14465 Marker AT 0 24 AA 3 1 TT 4 6
23 group1 14468 14470 Marker CC 18 23 CC 3 1 CC 6 5
25 group1 14652 14654 Marker CC 3 8 NA 0 0 CC 3 1
26 group1 14670 14672 Marker GG 10 11 NA 0 0 NA 0 0
Run Code Online (Sandbox Code Playgroud)
错误信息:
Traceback (most recent call last):
File "test_script.py", line 8, in <module>
data = pd.concat(data1, ignore_index=True)
File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 206, in concat
copy=copy)
File "/home/user/.local/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 236, in __init__
objs = list(objs)
File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 978, in __next__
return self.get_chunk()
File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1042, in get_chunk
return self.read(nrows=size)
File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1005, in read
ret = self._engine.read(nrows)
File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1748, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10885)
File "pandas/_libs/parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)
File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)
File "pandas/_libs/parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
/opt/gridengine/default/Federation/spool/execd/kcompute030/job_scripts/5883517: line 10: 29990 Segmentation fault (core dumped) python3.6 test_script.py
Run Code Online (Sandbox Code Playgroud)
您根本没有以块的形式处理数据。
使用data1 = pd.read_csv('...', chunksize=10000, iterator=True),
data1变为pandas.io.parser.TextFileReader, 这是一个迭代器,可将 CSV 数据的 10000 行块生成为 DataFrame。
但随后pd.concat会消耗整个迭代器,因此尝试将整个 CSV 加载到内存中,从而完全违背了使用chunksizeand的目的iterator。
正确使用chunksize和iterator
为了以块的形式处理数据,您必须迭代迭代器提供的实际 DataFrame 块read.csv。
data1 = pd.read_csv(path, sep='\t',chunksize=1000, iterator=True)
for chunk in data1:
# do my processing of DataFrame chunk of 1000 rows here
Run Code Online (Sandbox Code Playgroud)
假设我们有一个 CSV bigdata.txt
A1, A2
B1, B2
C1, C2
D1, D2
E1, E2
Run Code Online (Sandbox Code Playgroud)
我们希望一次处理 1 行(无论出于何种原因)。
chunksize和的错误用法iterator
df_iter = pd.read_csv('bigdata.txt', chunksize=1, iterator=True, header=None)
df = pd.concat(df_iter)
df
## 0 1
## 0 A1 A2
## 1 B1 B2
## 2 C1 C2
## 3 D1 D2
## 4 E1 E2
Run Code Online (Sandbox Code Playgroud)
我们可以看到我们已经将整个 CSV 加载到内存中,尽管 achunksize为 1。
正确使用
df_iter = pd.read_csv('bigdata.txt', chunksize=1, iterator=True, header=None)
for iter_num, chunk in enumerate(df_iter, 1):
print('Processing iteration {0}'.format(iter_num))
print(chunk)
## Processing iteration 1
## 0 1
## 0 A1 A2
## Processing iteration 2
## 0 1
## 1 B1 B2
## Processing iteration 3
## 0 1
## 2 C1 C2
## Processing iteration 4
## 0 1
## 3 D1 D2
## Processing iteration 5
## 0 1
## 4 E1 E2
Run Code Online (Sandbox Code Playgroud)