如何在单独的数据帧之间合并和分组

con*_*449 6 python pandas

我有两个要合并/分组的数据框。它们如下:

df_1


        words      start   stop
0            Oh,    6.72   7.21
1          okay,    7.26   8.01
2             go  12.82   12.90
3         ahead.   12.91  12.94
4             NaN  15.29  15.62
5             NaN  15.63  15.99
6             NaN  16.09  16.36
7             NaN  16.37  16.96
8             NaN  17.88  18.36
9             NaN  18.37  19.36
Run Code Online (Sandbox Code Playgroud)

df_2

data     start        stop
10         1.0        3.5
14         4.0       8.5
11         9.0       13.5
12        14.0       20.5
Run Code Online (Sandbox Code Playgroud)

我想将 df_1.words 合并到 df_2,但将 df_1.words 中的所有值分组,其中 df_1.start 位于 df_2.start 和 df_2.stop 之间。它应该是这样的:

df_2

data     start        stop   words
10         1.0        3.5     NaN
14         4.0       8.5      Oh, okay,
11         9.0       13.5     go ahead.
12        14.0       20.5     NaN, NaN, NaN, NaN, NaN, NaN
Run Code Online (Sandbox Code Playgroud)

ALo*_*llz 1

如果 bin 边缘不像示例中那样重叠,请使用pd.cut, 和 来IntervalIndex对第一个 DataFrame 进行分组。这允许您在两个边缘上都关闭。然后从“停止”列中进行选择df_2以获取聚合结果。

import pandas as pd

idx = pd.Index([pd.Interval(*x, closed='both') for x in zip(df_2.start, df_2.stop)])

s = df_1.groupby(pd.cut(df_1.start, idx)).words.agg(list)

# Closed on both, can use `'stop'` to align
df_2['words'] = s[df_2.stop].to_list()
Run Code Online (Sandbox Code Playgroud)
print(df_2)
   data  start  stop                           words
0    10    1.0   3.5                              []
1    14    4.0   8.5                    [Oh,, okay,]
2    11    9.0  13.5                    [go, ahead.]
3    12   14.0  20.5  [nan, nan, nan, nan, nan, nan]
Run Code Online (Sandbox Code Playgroud)