我试图在pandas中读取一个大的csv文件(aprox.6 GB),我收到以下内存错误:
MemoryError Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')
C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format)
450 infer_datetime_format=infer_datetime_format)
451
--> 452 return _read(filepath_or_buffer, kwds)
453
454 parser_f.__name__ …Run Code Online (Sandbox Code Playgroud) 我有一个非常大的数据集,我无法读取整个数据集.所以,我想只读一部分进行训练,但我不知道该怎么做.任何想法将不胜感激.
似乎我可以通过创建mmap'ddarray并使用它来初始化Series来为python系列记录底层数据.
def assert_readonly(iloc):
try:
iloc[0] = 999 # Should be non-editable
raise Exception("MUST BE READ ONLY (1)")
except ValueError as e:
assert "read-only" in e.message
# Original ndarray
n = 1000
_arr = np.arange(0,1000, dtype=float)
# Convert it to a memmap
mm = np.memmap(filename, mode='w+', shape=_arr.shape, dtype=_arr.dtype)
mm[:] = _arr[:]
del _arr
mm.flush()
mm.flags['WRITEABLE'] = False # Make immutable!
# Wrap as a series
s = pd.Series(mm, name="a")
assert_readonly(s.iloc)
Run Code Online (Sandbox Code Playgroud)
成功!它似乎s是由只读的mem映射的ndarray支持.我可以为DataFrame执行相同的操作吗?以下失败
df = pd.DataFrame(s, copy=False, columns=['a'])
assert_readonly(df["a"]) # Fails …Run Code Online (Sandbox Code Playgroud) 我正在尝试读取和处理1000个文件,但不幸的是,处理文件的时间大约是从磁盘读取文件的3倍,因此我希望在读入时处理这些文件(当我在我继续阅读其他文件).
在一个完美的世界中,我有一个一次读取一个文件的生成器,我想将这个生成器传递给一个工作池,这些工作器在(缓慢)生成时处理来自生成器的项目.
这是一个例子:
def process_file(file_string):
...
return processed_file
pool = Pool(processes=4)
path = 'some/path/'
results = pool.map(process_file, (open(path+part,'rb').read() for part in os.listdir(path)))
Run Code Online (Sandbox Code Playgroud)
上面代码的唯一问题是在池开始之前所有文件都被读入内存,这意味着我需要等待磁盘读取所有内容,并且还消耗大量内存.
我正在对 pandas 数据帧进行多重处理,方法是将其拆分为多个数据帧,这些数据帧存储为列表。并且,使用Pool.map()我将数据帧传递给定义的函数。我的输入文件约为“300 mb”,因此小数据帧大约为“75 mb”。但是,当多处理运行时,内存消耗会增加 7 GB,每个本地进程大约消耗 1 GB 内存。2 GB 内存。为什么会发生这种情况?
def main():
my_df = pd.read_table("my_file.txt", sep="\t")
my_df = my_df.groupby('someCol')
my_df_list = []
for colID, colData in my_df:
my_df_list.append(colData)
# now, multiprocess each small dataframe individually
p = Pool(3)
result = p.map(process_df, my_df_list)
p.close()
p.join()
print('Global maximum memory usage: %.2f (mb)' % current_mem_usage())
result_merged = pd.concat(result)
# write merged data to file
def process_df(my_df):
my_new_df = do something with "my_df"
print('\tWorker maximum memory usage: %.2f (mb)' % …Run Code Online (Sandbox Code Playgroud) python multiprocessing threadpool pandas python-multiprocessing
python ×5
pandas ×4
csv ×2
chunks ×1
file-io ×1
memory ×1
numpy ×1
numpy-memmap ×1
threadpool ×1