Cai*_*tru 4 python group-by dataframe pandas
我有以下数据名:
import numpy as np
import pandas as pd
df = {'ID': ['1','1','2', '2', '3', '3', '4', '4', '4'],
'USER' : ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'C'],
'DATE_VIEW': ['16/05/2019','18/05/2019', '16/03/2020', '18/03/2020', '16/07/2020', '21/07/2020', '13/02/2020', '14/02/2020', '15/02/2020'],
'DATE_ACCEPT': ['17/05/2019', np.nan, np.nan, '18/03/2020', '16/07/2020', np.nan, np.nan, '14/02/2020', np.nan],
}
df = pd.DataFrame(df)
df['DATE_VIEW'] = pd.to_datetime(df['DATE_VIEW'], format = '%d/%m/%Y')
df['DATE_ACCEPT'] = pd.to_datetime(df['DATE_ACCEPT'], format = '%d/%m/%Y')
df
Run Code Online (Sandbox Code Playgroud)
我正在寻找一种方式,让独特df['ID']的行,如果df['DATE_VIEW']是比小df['DATE_VIEW']的时候df['DATE_ACCEPT]已被填充拖放行它,如果df['DATE_VIEW']是擦丝器比df['DATE_VIEW']当df['DATE_ACCEPT]已经填充了特定的df['ID']。预期输出如下:
您可以获取每一行groupby的ID列和,然后与该日期进行比较:transformDATE_ACCEPTDATE_VIEW
df.loc[df['DATE_VIEW'].le(df.groupby('ID')['DATE_ACCEPT'].transform('max'))]
Run Code Online (Sandbox Code Playgroud)
输出:
ID USER DATE_VIEW DATE_ACCEPT
0 1 A 2019-05-16 2019-05-17
2 2 A 2020-03-16 NaT
3 2 B 2020-03-18 2020-03-18
4 3 A 2020-07-16 2020-07-16
6 4 A 2020-02-13 NaT
7 4 B 2020-02-14 2020-02-14
Run Code Online (Sandbox Code Playgroud)
PSreset_index(drop=True)如果你想让它看起来和你预期的输出完全一样,你当然可以在之后
更新如果你想保留两个日期所在的行np.nan,你可以添加另一个布尔掩码并应用它|:
# the original condition DATE_VIEW <= DATE_ACCEPT
m1 = df['DATE_VIEW'].le(df.groupby('ID')['DATE_ACCEPT'].transform('max'))
# both dates are np.nan
m2 = df[['DATE_VIEW', 'DATE_ACCEPT']].isna().all(axis=1)
df.loc[m1|m2]
Run Code Online (Sandbox Code Playgroud)