Pandas 根据列的值有效地分块读取大面板 CSV

Question

Pandas 根据列的值有效地分块读取大面板 CSV

我有一个很大的 CSV 文件（磁盘上约 50 GB），但无法立即将其完全读入内存。数据集本身是面板数据，看起来像

ID Time     Col 1 ... Col N
1  2000/1/1 ...
1  2000/1/2
...
2  2000/1/1 ...
...

Run Code Online (Sandbox Code Playgroud)

我加载这些数据的想法是以块的形式读取它，进行一些预处理以减少大小，然后单独保存每个块。我知道使用pd.read_csv(..., chunksize=1000)它可以让我循环遍历大小为 1000 的块，但为了使预处理准确，我更愿意循环遍历与 ID 列相对应的块。（需要对特定的所有行ID进行准确的预处理）

换句话说，假设我有一个较小的文件，其中包含所有值ID（例如 1-1000）。然后，我想做一些类似的事情

list_of_id_chunks = [ [1,2,3], [4,5,6], [7,8,9], ... ] # Split the total IDs into chunks of 3 IDs each

for chunk_of_ids in list_of_id_chunks:
    # 1. Read the large csv file with only the rows where `ID` is in chunk_of_ids
    # (For the first iteration, this should have rows with ID = 1, 2, or 3)
    # 2. Do some preprocessing to trim file size
    # 3. Save files in csv, feather, etc

Run Code Online (Sandbox Code Playgroud)

有什么建议么？

Answer 1

BeR*_*2me 5

您可以从这样的事情开始，它一次读取文件 100 万行，按 ID 分解每个块，然后按 ID 保存到新文件。最后，每个 ID 都会有一个单独的文件。

with pd.read_csv('big_file.csv', chunksize=1e6) as reader:
    for chunk in reader:
        for name, group in chunk.groupby('ID'):
            group.to_csv(f'big_file_id_{name}.csv', mode='a', index=False, header=False)

Run Code Online (Sandbox Code Playgroud)

归档时间：	3 年，10 月前
查看次数：	1713 次
最近记录：	3 年，10 月前