分块读取 Pandas 中的多个 CSV 文件

Question

分块读取 Pandas 中的多个 CSV 文件

pyt*_*nja 5 python pandas sklearn-pandas jupyter-notebook

当我们有多个 csv 文件并且所有 csv 的总大小约为 20gb 时，如何分块导入和读取多个 CSV？

我不想使用，Spark因为我想在 SkLearn 中使用模型，所以我想要解决方案Pandas本身。

我的代码是：

allFiles = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f,sep=",") for f in allFiles))
df.reset_index(drop=True, inplace=True)

Run Code Online (Sandbox Code Playgroud)

但这失败了，因为我路径中所有 csv 的总大小为 17gb。

我想分块阅读它，但如果我这样尝试，我会收到一些错误：

  allFiles = glob.glob(os.path.join(path, "*.csv"))
  df = pd.concat((pd.read_csv(f,sep=",",chunksize=10000) for f in allFiles))
  df.reset_index(drop=True, inplace=True)

Run Code Online (Sandbox Code Playgroud)

我得到的错误是这样的：

“无法连接“”类型的对象；只有 pd.Series、pd.DataFrame 和 pd.Panel（不推荐使用）对象有效”

有人可以帮忙吗？

Answer 1

Fre*_*chy 0

要读取大型 csv 文件，您可以使用 chunksize 但在这种情况下您必须使用迭代器，如下所示：

for df in pd.read_csv('file.csv', sep=',', iterator=True, chunksize=10000):
    process(df)

Run Code Online (Sandbox Code Playgroud)

你必须连接或附加每个块

或者你可以这样做：

df = pd.read_csv('file.csv',, sep=',', iterator=True, chunksize=10000)
for chunk in df:
    process(chunk)

Run Code Online (Sandbox Code Playgroud)

读取多个文件：例如

listfile = ['file1,'file2]
dfx = pd.DataFrame()
def process(d):
    #dfx=dfx.append(d) or dfx = pd.concat(dfx, d)
    #other coding

for f in listfile:
    for df in pd.read_csv(f, sep=',', iterator=True, chunksize=10000):
        process(df)

Run Code Online (Sandbox Code Playgroud)

当您拥有大量文件后，您可以使用多处理库中的 DASK 或 Pool 来启动大量读取过程

无论如何，要么你有足够的记忆，要么你失去时间

归档时间：	6 年，11 月前
查看次数：	4219 次
最近记录：	5 年，10 月前