在数据框中复制行 x 次 - 提高性能

Question

在数据框中复制行 x 次 - 提高性能

我正在寻找复制数据帧行的最有效的解决方案。每行应复制 x 次，其中 x 对于每行都是唯一的。

假设这是我给定的数据框：

| id | count |
|----|-------|
| a  | 1     |
| b  | 2     |
| c  | 5     |

Run Code Online (Sandbox Code Playgroud)

结果数据框应如下所示，其中每行都按照“count”列中给出的数量进行复制：

| id | count |
|----|-------|
| a  | 1     |
| b  | 2     |
| b  | 2     |
| c  | 5     |
| c  | 5     |
| c  | 5     |
| c  | 5     |
| c  | 5     |

Run Code Online (Sandbox Code Playgroud)

一个非常基本的方法是循环数据帧并附加行 x 次，如下所示：

data = {'id': ['a', 'b', 'c'], 'count': [1, 2, 5]}
df = pd.DataFrame(data=data)

for index, row in df.iterrows():
    for x in range(row['count']-1):
        df = df.append(pd.Series(row, index=df.columns), ignore_index=True)

df = df.sort_values(by=['id'])
df = df.reset_index(drop=True)

df

Run Code Online (Sandbox Code Playgroud)

虽然这适用于小型数据帧，但对于具有数千行的大型数据帧来说效率不高。由于每行必须复制最多 200 次，因此最终的数据帧可以包含数百万行。

已经阅读了 pandas/numpy 矢量化，但不幸的是，我不知道它是否（以及如何）在这种情况下有帮助，因为我必须向数据帧添加很多行。

对于如何提高性能有什么建议吗？

Answer 1

jez*_*ael 5

使用Index.repeatif 唯一索引值，然后传递给DataFrame.loc：

df1 = df.loc[df.index.repeat(df['count'])].reset_index(drop=True)
print (df1)
  id  count
0  a      1
1  b      2
2  b      2
3  c      5
4  c      5
5  c      5
6  c      5
7  c      5

Run Code Online (Sandbox Code Playgroud)

如果可能的话，可以使用索引值中的一些重复项numpy.repeat和DataFrame.iloc：

print (df)
  id  count
0  a      1
1  b      2
1  c      5

df1 = df.iloc[np.repeat(np.arange(len(df.index)), df['count'])].reset_index(drop=True)
print (df1)
  id  count
0  a      1
1  b      2
2  b      2
3  c      5
4  c      5
5  c      5
6  c      5
7  c      5

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，2 月前
查看次数：	498 次
最近记录：	5 年，2 月前