加载由列变量确定的chunksize的pandas数据框

Question

加载由列变量确定的chunksize的pandas数据框

如果我的csv文件太大而无法使用大熊猫（在本例中为35gb）加载到内存中，那么我知道可以使用块大小对文件进行分块处理。

但是我想知道是否可以根据列中的值更改块大小。

我有一个ID列，然后每个ID都有几行包含信息，如下所示：

ID,   Time,  x, y
sasd, 10:12, 1, 3
sasd, 10:14, 1, 4
sasd, 10:32, 1, 2
cgfb, 10:02, 1, 6
cgfb, 10:13, 1, 3
aenr, 11:54, 2, 5
tory, 10:27, 1, 3
tory, 10:48, 3, 5
ect...

Run Code Online (Sandbox Code Playgroud)

我不想将ID分成不同的块。例如，将处理大小为4的块：

ID,   Time,  x, y
sasd, 10:12, 1, 3
sasd, 10:14, 1, 4
sasd, 10:32, 1, 2
cgfb, 10:02, 1, 6
cgfb, 10:13, 1, 3 <--this extra line is included in the 4 chunk

ID,   Time,  x, y
aenr, 11:54, 2, 5
tory, 10:27, 1, 3
tory, 10:48, 3, 5
...

Run Code Online (Sandbox Code Playgroud)

可能吗？

如果可能的话，也许不使用带有for循环的csv库：

for line in file:
    x += 1
    if x > 1000000 and curid != line[0]:
        break
    curid = line[0]
    #code to append line to a dataframe

Run Code Online (Sandbox Code Playgroud)

尽管我知道这只会创建一个块，并且for循环需要很长时间才能处理。

Answer 1

elc*_*ato 5

如果逐行遍历csv文件，则可以yield使用依赖于任何列的生成器对数据块进行分块。

工作示例：

import pandas as pd

def iter_chunk_by_id(file):
    csv_reader = pd.read_csv(file, iterator=True, chunksize=1, header=None)
    first_chunk = csv_reader.get_chunk()
    id = first_chunk.iloc[0,0]
    chunk = pd.DataFrame(first_chunk)
    for l in csv_reader:
        if id == l.iloc[0,0]:
            id = l.iloc[0,0]
            chunk = chunk.append(l)
            continue
        id = l.iloc[0,0]
        yield chunk
        chunk = pd.DataFrame(l)
    yield chunk

## data.csv ##
# 1, foo, bla
# 1, off, aff
# 2, roo, laa
# 3, asd, fds
# 3, qwe, tre
# 3, tre, yxc   

chunk_iter = iter_chunk_by_id("data.csv")

for chunk in chunk_iter:
    print(chunk)
    print("_____")

Run Code Online (Sandbox Code Playgroud)

输出：

   0     1     2
0  1   foo   bla
1  1   off   aff
_____
   0     1     2
2  2   roo   laa
3  2   jkl   xds
_____
   0     1     2
4  3   asd   fds
5  3   qwe   tre
6  3   tre   yxc
_____

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，8 月前
查看次数：	2333 次
最近记录：	6 年前