将大型 CSV 文件按组均匀拆分为较小的 CSV 文件的更快方法?

Gre*_*dot 5 python csv awk file pandas

我确信有更好的方法,但我还是一片空白。我有一个这种格式的 CSV 文件。ID 列已排序,因此所有内容至少都分组在一起:

Text                 ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text3, CCCC
this is sample text4, DDDD
this is sample text4, DDDD
this is sample text5, EEEE
this is sample text5, EEEE
this is sample text6, FFFF
this is sample text6, FFFF
Run Code Online (Sandbox Code Playgroud)

我想要做的是将 CSV 快速拆分为 X 个较小的 CSV 文件。因此,如果 X==3,则 AAAA 将进入“1.csv”,BBBB 将进入“2.csv”,CCCC 将进入“3.csv”,下一组将循环返回并进入“1” .csv”。

这些组的大小各不相同,因此按数字进行硬编码分割在这里不起作用。

有没有比我当前仅在 Python 中使用 Pandas groupby 来编写它们的方法更快的方法来可靠地拆分它们?

    file_ = 0
    num_files = 3

    for name, group in df.groupby(by=['ID'], sort=False):

        file_+=1
        group['File Num'] = file_

        group.to_csv(file_+'.csv',index=False, header=False, mode='a')

        if file_ == num_files:

            file_ = 0
Run Code Online (Sandbox Code Playgroud)

awk这是一个基于 python 的解决方案,但如果它能完成工作,我愿意使用bash 或 bash 。

编辑:

为了澄清起见,我希望将组划分为我可以设置的固定数量的文件。

在本例中,为 3。(因此 x = 3)。第一组 (AAAA) 将进入 1.csv,第二组进入 2.csv,第三组进入 3.csv,然后对于第四组,它将循环返回并将其插入到 1.csv。ETC。

示例输出 1.csv:

Text                 ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text4, DDDD
this is sample text4, DDDD
Run Code Online (Sandbox Code Playgroud)

示例输出 2.csv:

Text                 ID
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text5, EEEE
this is sample text5, EEEE
Run Code Online (Sandbox Code Playgroud)

示例输出 3.csv:

Text                 ID
this is sample text3, CCCC
this is sample text6, FFFF
this is sample text6, FFFF
Run Code Online (Sandbox Code Playgroud)

Ed *_*ton 4

在每个 Unix 机器上的任何 shell 中使用任何 awk:

$ cat tst.awk
NR==1 {
    hdr = $0
    next
}
$NF != prev {
    out = (((blockCnt++) % X) + 1) ".csv"
    if ( blockCnt <= X ) {
        print hdr > out
    }
    prev = $NF
}
{ print > out }
Run Code Online (Sandbox Code Playgroud)

$ awk -v X=3 -f tst.awk input.csv
Run Code Online (Sandbox Code Playgroud)

$ head [0-9]*.csv
==> 1.csv <==
Text                 ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text4, DDDD
this is sample text4, DDDD

==> 2.csv <==
Text                 ID
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text5, EEEE
this is sample text5, EEEE

==> 3.csv <==
Text                 ID
this is sample text3, CCCC
this is sample text6, FFFF
this is sample text6, FFFF
Run Code Online (Sandbox Code Playgroud)

如果X某个数量足够大,超出了同时打开文件的系统限制,并且开始收到“打开文件过多”错误,那么您需要使用 GNU awk,因为它会在内部处理该问题,或者将代码更改为只有 1 个一次打开文件:

NR==1 {
    hdr = $0
    next
}
$NF != prev {
    close(out)
    out = (((blockCnt++) % X) + 1) ".csv"
    if ( blockCnt <= X ) {
        print hdr > out
    }
    prev = $NF
}
{ print >> out }
Run Code Online (Sandbox Code Playgroud)

或者实现您自己的方式来管理同时打开的文件数量。


编辑:以下是 @PaulHodges 在评论中的建议将产生如下脚本:

NR == 1 {
    for ( i=1; i <= X; i++ ) {
        print > (i ".csv")
    }
    next
}
$NF != prev {
    out = (((NR-1) % X) + 1) ".csv"
    prev = $NF
}
{ print > out }
Run Code Online (Sandbox Code Playgroud)