拆分大型pandas数据帧

Question

拆分大型pandas数据帧

我有一个423244行的大型数据帧.我想将其拆分为4.我尝试了下面的代码,它给出了一个错误？ValueError: array split does not result in an equal division

for item in np.split(df, 4):
    print item

Run Code Online (Sandbox Code Playgroud)

如何将此数据帧拆分为4组？

Answer 1

roo*_*oot 123

用途np.array_split:

Docstring:
Split an array into multiple sub-arrays.

Please refer to the ``split`` documentation.  The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.

Run Code Online (Sandbox Code Playgroud)

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : randn(8), 'D' : randn(8)})

In [3]: print df
     A      B         C         D
0  foo    one -0.174067 -0.608579
1  bar    one -0.860386 -1.210518
2  foo    two  0.614102  1.689837
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468

In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]: 
[     A    B         C         D
0  foo  one -0.174067 -0.608579
1  bar  one -0.860386 -1.210518
2  foo  two  0.614102  1.689837,
      A      B         C         D
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861,
      A      B         C         D
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468]

Run Code Online (Sandbox Code Playgroud)

@NilaniAlgiriyage - `array_split`返回一个DataFrames列表,所以你可以遍历列表...... (4认同)
这个答案已经过时：AttributeError：“ DataFrame”对象没有属性“ size”。 (2认同)

Answer 2

eli*_*xir 19

我想做同样的事情,我首先遇到分裂问题,然后安装pandas 0.15.2的问题,所以我回到了我的旧版本,并编写了一个非常好用的小功能.我希望这可以帮助你!

# input - df: a Dataframe, chunkSize: the chunk size
# output - a list of DataFrame
# purpose - splits the DataFrame into smaller of max size chunkSize (last is smaller)
def splitDataFrameIntoSmaller(df, chunkSize = 10000): 
    listOfDf = list()
    numberChunks = len(df) // chunkSize + 1
    for i in range(numberChunks):
        listOfDf.append(df[i*chunkSize:(i+1)*chunkSize])
    return listOfDf

Run Code Online (Sandbox Code Playgroud)

计算numberChunks的正确方法 import math numberChunks = math.ceil(len(df) / chunkSize) (9认同)
@SergeyLeyko 是正确的。否则，当“df”大小可被块大小整除时，您会在“chunks”列表的末尾得到一个空数据帧。这是 num_chunks 的替代方案 `num_chunks = len(df) // chunk_size + (1 if len(df) % chunk_size else 0)` (9认同)
比使用np.array_split（）快得多 (3认同)

Answer 3

Ris*_*Vij 11

您可以使用列表推导式在一行中完成此操作

n = 4
chunks = [df[i:i+n] for i in range(0,df.shape[0],n)]

Run Code Online (Sandbox Code Playgroud)

Answer 4

Gil*_*rto 10

请注意,np.array_split(df, 3)将数据帧拆分为3个子数据帧,同时splitDataFrameIntoSmaller(df, chunkSize = 3)将每个chunkSize行拆分为数据帧.

例:

df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11], columns=['TEST'])
df_split = np.array_split(df, 3)

Run Code Online (Sandbox Code Playgroud)

您将获得3个子数据帧:

df_split[0] # 1, 2, 3, 4
df_split[1] # 5, 6, 7, 8
df_split[2] # 9, 10, 11

Run Code Online (Sandbox Code Playgroud)

附:

df_split2 = splitDataFrameIntoSmaller(df, chunkSize = 3)

Run Code Online (Sandbox Code Playgroud)

您将获得4个子数据帧:

df_split2[0] # 1, 2, 3
df_split2[1] # 4, 5, 6
df_split2[2] # 7, 8, 9
df_split2[3] # 10, 11

Run Code Online (Sandbox Code Playgroud)

希望我是对的,希望这是有用的.

感谢您的澄清。投我一票吧！ (2认同)

Answer 5

yem*_*emu 8

警告:

np.array_split不适用于numpy-1.9.0.我检查了:它适用于1.8.1.

错误:

Dataframe没有'size'属性

我在pandas github上提交了一个错误:https://github.com/pydata/pandas/issues/8846似乎已经修复了pandas 0.15.2 (6认同)

Answer 6

pra*_*por 7

我想现在我们可以使用plain iloc了range。

chunk_size = int(df.shape[0] / 4)
for start in range(0, df.shape[0], chunk_size):
    df_subset = df.iloc[start:start + chunk_size]
    process_data(df_subset)
    ....

Run Code Online (Sandbox Code Playgroud)

Answer 7

rap*_*ael 7

基于@elixir的答案...
我建议使用生成器来避免加载内存中的所有块：

def chunkit(df, chunk_size = 10000): 
    num_chunks = len(df) // chunk_size
    if len(df) % chunk_size != 0:
        num_chunks += 1
    for i in range(num_chunks):
        yield df[i*chunk_size:(i + 1) * chunk_size]

Run Code Online (Sandbox Code Playgroud)

Answer 8

drk*_*rkr 7

我喜欢俏皮话，所以@LucyDrops 的答案对我有用。

然而，有一件重要的事情：添加一个.copy()if chunks 应该是原始df部分的副本：

chunks = [df[i:i+n].copy() for i in range(0,df.shape[0],n)]

Run Code Online (Sandbox Code Playgroud)

否则，chunks在进一步处理期间（例如循环中）很有可能收到下一个警告：

A value is trying to be set on a copy of a slice from a DataFrame.

Run Code Online (Sandbox Code Playgroud)

（详情请参阅Pandas 文档）

Answer 9

rum*_*pel 5

您可以使用groupby，假设您有一个整数枚举索引：

import math
df = pd.DataFrame(dict(sample=np.arange(99)))
rows_per_subframe = math.ceil(len(df) / 4.)

subframes = [i[1] for i in df.groupby(np.arange(len(df))//rows_per_subframe)]

Run Code Online (Sandbox Code Playgroud)

注意：groupby返回一个元组，其中第二个元素是数据帧，因此提取稍微复杂一些。

>>> len(subframes), [len(i) for i in subframes]
(4, [25, 25, 25, 24])

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，7 月前
查看次数：	77228 次
最近记录：	7 年，10 月前