按行组合熊猫数据帧的有效方法

sed*_*deh 5 python numpy pandas

我有14个数据帧,每个数据帧有14列和超过250,000行。数据框具有相同的列标题,我想逐行合并数据框。我试图将数据帧连接到一个“不断增长的” DataFrame,这需要几个小时。

本质上,我做了13次以下操作:

DF = pd.DataFrame()
for i in range(13):   
    DF = pd.concat([DF, subDF])
Run Code Online (Sandbox Code Playgroud)

这里的stackoverflow答案建议将所有子数据帧附加到列表中,然后串联子数据帧列表。

听起来像这样:

DF = pd.DataFrame()
lst = [subDF, subDF, subDF....subDF] #up to 13 times
for subDF in lst:
    DF = pd.concat([DF, subDF])
Run Code Online (Sandbox Code Playgroud)

他们不是同一回事吗?也许我误会了建议的工作流程。这是我测试过的。

import numpy
import pandas as pd
import timeit

def test1():
    "make all subDF and then concatenate them"
    numpy.random.seed(1)
    subDF = pd.DataFrame(numpy.random.rand(1))
    lst = [subDF, subDF, subDF]
    DF = pd.DataFrame()
    for subDF in lst:
        DF = pd.concat([DF, subDF], axis=0,ignore_index=True)

def test2():
    "add each subDF to the collecitng DF as you're making the subDF"
    numpy.random.seed(1)
    DF = pd.DataFrame()
    for i in range(3):
        subDF = pd.DataFrame(numpy.random.rand(1))
        DF = pd.concat([DF, subDF], axis=0,ignore_index=True)

print('test1() takes {0} sec'.format(timeit.timeit(test1, number=1000)))
print('test2() takes {0} sec'.format(timeit.timeit(test2, number=1000)))

>> Output

test1() takes 12.732409087137057 sec
test2() takes 15.097430311612698 sec
Run Code Online (Sandbox Code Playgroud)

非常感谢您提出的有效方法以行方式连接多个大型数据帧的建议。谢谢!

Alb*_*oso 8

创建一个包含所有数据框的列表:

dfs = []
for i in range(13):
    df = ... # However it is that you create your dataframes   
    dfs.append(df)
Run Code Online (Sandbox Code Playgroud)

然后一键连接它们:

merged = pd.concat(dfs) # add ignore_index=True if appropriate
Run Code Online (Sandbox Code Playgroud)

这比您的代码快得多,因为它恰好创建了14个数据帧(您原来的13个+ merged),而您的代码创建了26 个数据帧(您原来的13 个+ 13个中间合并)。

编辑:

这是您的测试代码的变体。

import numpy
import pandas as pd
import timeit

def test_gen_time():
    """Create three large dataframes, but don't concatenate them"""
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))

def test_sequential_concat():
    """Create three large dataframes, concatenate them one by one"""
    DF = pd.DataFrame()
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))
        DF = pd.concat([DF, df], ignore_index=True)

def test_batch_concat():
    """Create three large dataframes, concatenate them at the end"""
    dfs = []
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))
        dfs.append(df)
    DF = pd.concat(dfs, ignore_index=True)

print('test_gen_time() takes {0} sec'
          .format(timeit.timeit(test_gen_time, number=200)))
print('test_sequential_concat() takes {0} sec'
          .format(timeit.timeit(test_sequential_concat, number=200)))
print('test_batch_concat() takes {0} sec'
          .format(timeit.timeit(test_batch_concat, number=200)))
Run Code Online (Sandbox Code Playgroud)

输出:

test_gen_time() takes 10.095820872998956 sec
test_sequential_concat() takes 17.144756617000894 sec
test_batch_concat() takes 12.99131180600125 sec
Run Code Online (Sandbox Code Playgroud)

最大份额对应于生成数据帧。批量连接大约需要2.9秒;顺序连接需要7秒钟以上。