具有开始列和结束列的同一 pandas 数据框中的高效合并重叠间隔

Sae*_*isa 12 dataframe python-3.x pandas

我正在合并 pandas 数据框中的重叠间隔,并寻找在 pandas 中执行此操作的有效方法,除了在 1 行到 1 行上运行的常规算法之外。我该如何在 pandas 中执行此操作?

我尝试过在每一行上运行的常规重叠算法,并询问当前 row.start < last_end 是否。这对我有用。

假设我输入了以下数据框

df:
    START   FINISH
0   0.000000    10.000000
1   10.000000   8700.182997
2   0.000000    10.000000
3   10.000000   9720.687227
4   9850.123    9990.000000
Run Code Online (Sandbox Code Playgroud)

我希望输出如下:

df:
    START   FINISH
0   0.000000    9720.687227
2   9850.123    9990.000000
Run Code Online (Sandbox Code Playgroud)

提前致谢!

Dev*_*dka 13

你可以只使用 pandas 来做到这一点

import pandas as pd
import io

## load data

raw ="""START,FINISH
0.000000    ,10.000000
10.000000   ,4500.182997
5000.00    ,7000.000000
6000   ,8500.687227
9850.123,9990.000000
"""

buf_bytes = io.StringIO(raw)
df=pd.read_csv(buf_bytes)

## solution

df.sort_values("START", inplace=True)

## This line compares if START of next row is greater than FINISH of current
## row ("shift" shifts down FINISH by one row). The value of expression before
## cumsum will be True if interval breaks (i.e. cannot be merged), so  
## cumsum will increment group value when interval breaks (cum sum treats True=1, False=0)
df["group"]=(df["START"]>df["FINISH"].shift()).cumsum()

## this returns min value of "START" column from a group and max value fro m "FINISH"
result=df.groupby("group").agg({"START":"min", "FINISH": "max"})
display(result)
Run Code Online (Sandbox Code Playgroud)

输出

 START       FINISH
group                       
0         0.000  4500.182997
1      5000.000  8500.687227
2      9850.123  9990.000000
Run Code Online (Sandbox Code Playgroud)


小智 9

上面的回答很鼓舞人心,但还有一些地方需要改进。

(1) 应记录将shift()向上移动一条记录,而不是向下移动。(2) 它不考虑行何时位于前一条记录的边界内。只需添加cummax()即可解决。

这是修改后的代码:

import pandas as pd
import io

## load data

raw ="""START,FINISH
0.000000    ,10.000000
2.000000    ,3.000000
10.000000   ,4500.182997
5000.00    ,7000.000000
6000   ,8500.687227
9850.123,9990.000000
"""

buf_bytes = io.StringIO(raw)
df=pd.read_csv(buf_bytes)

## solution

df.sort_values("START", inplace=True)

## This line compares if START of present row is greater than largest FINISH in previous 
## rows ("shift" shifts up FINISH by one row). The value of expression before
## cumsum will be True if interval breaks (i.e. cannot be merged), so
## cumsum will increment group value when interval breaks (cum sum treats True=1, False=0)


df["group"]=(df["START"]>df["FINISH"].shift().cummax()).cumsum()

print(df)

## this returns min value of "START" column from a group and max value fro m "FINISH"
result=df.groupby("group").agg({"START":"min", "FINISH": "max"})
print(result)
Run Code Online (Sandbox Code Playgroud)

输出:

          START       FINISH
group                       
0         0.000  4500.182997
1      5000.000  8500.687227
2      9850.123  9990.000000
Run Code Online (Sandbox Code Playgroud)

未修改解决方案的结果:

          START       FINISH
group                       
0         0.000    10.000000
1        10.000  4500.182997
2      5000.000  8500.687227
3      9850.123  9990.000000
Run Code Online (Sandbox Code Playgroud)