为什么Pandas连接(pandas.concat)所以内存效率不高?

sfo*_*ney 20 python ram numpy pandas

我有大约30 GB的数据(在大约900个数据帧的列表中),我试图连接在一起.我正在使用的机器是一个中等功能的Linux Box,大约256 GB的内存.但是,当我尝试连接我的文件时,我很快用完了可用的ram.我已经尝试了各种解决方法来解决这个问题(在较小批量中与for循环连接等)但我仍然无法将这些连接起来.两个问题浮现在脑海中:

  1. 还有其他人处理过此问题并找到了有效的解决方法吗?因为我需要的"列合并"(由于缺乏一个更好的词)的功能,我不能使用直追加join='outer'的说法pd.concat().

  2. 为什么Pandas连接(我知道它只是调用numpy.concatenate)因使用内存而效率低下?

我还应该注意到,我不认为问题是列的爆炸,因为将100个数据帧连接在一起会产生大约3000列,而基础数据帧大约为1000.

编辑:

我正在使用的数据是我的900个数据帧中的每一个大约1000列宽和大约50,000行深度的财务数据.从左到右的数据类型是:

  1. 日期字符串格式,
  2. string
  3. np.float
  4. int

......等等重复.我串连对列名的与外连接,这意味着在任何列df2不在df1不会被丢弃,而是分流到一边.


例:

 #example code
 data=pd.concat(datalist4, join="outer", axis=0, ignore_index=True)
 #two example dataframes (about 90% of the column names should be in common
 #between the two dataframes, the unnamed columns, etc are not a significant
 #number of the columns)

print datalist4[0].head()
                800_1     800_2   800_3  800_4               900_1     900_2  0 2014-08-06 09:00:00  BEST_BID  1117.1    103 2014-08-06 09:00:00  BEST_BID   
1 2014-08-06 09:00:00  BEST_ASK  1120.0    103 2014-08-06 09:00:00  BEST_ASK   
2 2014-08-06 09:00:00  BEST_BID  1106.9     11 2014-08-06 09:00:00  BEST_BID   
3 2014-08-06 09:00:00  BEST_ASK  1125.8     62 2014-08-06 09:00:00  BEST_ASK   
4 2014-08-06 09:00:00  BEST_BID  1117.1    103 2014-08-06 09:00:00  BEST_BID   

    900_3  900_4              1000_1    1000_2    ...     2400_4  0  1017.2    103 2014-08-06 09:00:00  BEST_BID    ...        NaN   
1  1020.1    103 2014-08-06 09:00:00  BEST_ASK    ...        NaN   
2  1004.3     11 2014-08-06 09:00:00  BEST_BID    ...        NaN   
3  1022.9     11 2014-08-06 09:00:00  BEST_ASK    ...        NaN   
4  1006.7     10 2014-08-06 09:00:00  BEST_BID    ...        NaN   

                      _1  _2  _3  _4                   _1.1 _2.1 _3.1  _4.1  0  #N/A Invalid Security NaN NaN NaN  #N/A Invalid Security  NaN  NaN   NaN   
1                    NaN NaN NaN NaN                    NaN  NaN  NaN   NaN   
2                    NaN NaN NaN NaN                    NaN  NaN  NaN   NaN   
3                    NaN NaN NaN NaN                    NaN  NaN  NaN   NaN   
4                    NaN NaN NaN NaN                    NaN  NaN  NaN   NaN   

      dater  
0  2014.8.6  
1  2014.8.6  
2  2014.8.6  
3  2014.8.6  
4  2014.8.6  

[5 rows x 777 columns]

print datalist4[1].head()
                150_1     150_2   150_3  150_4               200_1     200_2  0 2013-12-04 09:00:00  BEST_BID  1639.6     30 2013-12-04 09:00:00  BEST_ASK   
1 2013-12-04 09:00:00  BEST_ASK  1641.8    133 2013-12-04 09:00:08  BEST_BID   
2 2013-12-04 09:00:01  BEST_BID  1639.5     30 2013-12-04 09:00:08  BEST_ASK   
3 2013-12-04 09:00:05  BEST_BID  1639.4     30 2013-12-04 09:00:08  BEST_ASK   
4 2013-12-04 09:00:08  BEST_BID  1639.3    133 2013-12-04 09:00:08  BEST_BID   

    200_3  200_4               250_1     250_2    ...                 2500_1  0  1591.9    133 2013-12-04 09:00:00  BEST_BID    ...    2013-12-04 10:29:41   
1  1589.4     30 2013-12-04 09:00:00  BEST_ASK    ...    2013-12-04 11:59:22   
2  1591.6    103 2013-12-04 09:00:01  BEST_BID    ...    2013-12-04 11:59:23   
3  1591.6    133 2013-12-04 09:00:04  BEST_BID    ...    2013-12-04 11:59:26   
4  1589.4    133 2013-12-04 09:00:07  BEST_BID    ...    2013-12-04 11:59:29   

     2500_2 2500_3 2500_4         Unnamed: 844_1  Unnamed: 844_2  0  BEST_ASK   0.35     50  #N/A Invalid Security             NaN   
1  BEST_ASK   0.35     11                    NaN             NaN   
2  BEST_ASK   0.40     11                    NaN             NaN   
3  BEST_ASK   0.45     11                    NaN             NaN   
4  BEST_ASK   0.50     21                    NaN             NaN   

  Unnamed: 844_3 Unnamed: 844_4         Unnamed: 848_1      dater  
0            NaN            NaN  #N/A Invalid Security  2013.12.4  
1            NaN            NaN                    NaN  2013.12.4  
2            NaN            NaN                    NaN  2013.12.4  
3            NaN            NaN                    NaN  2013.12.4  
4            NaN            NaN                    NaN  2013.12.4  

[5 rows x 850 columns]
Run Code Online (Sandbox Code Playgroud)

Ale*_*der 15

我遇到了将大量DataFrame连接到"不断增长的"DataFrame的性能问题.我的解决方法是将所有子DataFrame追加到列表中,然后在完成子DataFrames的处理后连接DataFrames列表.

  • 这实际上是我目前的解决方法。它似乎工作正常,但我想知道是否有更好的方法。谢谢! (3认同)
  • 这个想法不适合我。我有近 600 万行,根据以下内容将其分成 1,000 行块:http://deo.im/2016/09/22/Load-data-from-mongodb-to-Pandas-DataFrame/。它运行良好,但是当我到达连接点时,它只是卡住了我的计算机。有任何想法吗? (2认同)