如何在熊猫中保持最快的骑行

Mat*_* P. 6 python performance dataframe python-3.x pandas

我有一个这样的数据框:

 df = pd.DataFrame({'origin': ['town a', 'town a', 'town a','town a', 'town c', 'town c'],\
'destination': ['town b', 'town b', 'town b','town b','town b','town b'], \
'departure_hour': ['09:30', '09:45','10:00', '10:30','14:30', '15:30'],\
'arrival_hour': ['11:30', '10:50','12:00', '11:45','16:30', '19:30'],\
'date': ['29-09-2020', '29-09-2020','29-09-2020','29-09-2020','29-09-2020','29-09-2020']})

   origin destination departure_hour arrival_hour        date
0  town a      town b          09:30        11:30  29-09-2020
1  town a      town b          09:45        10:50  29-09-2020
2  town a      town b          10:00        12:00  29-09-2020
3  town a      town b          10:30        11:45  29-09-2020
4  town c      town b          14:30        16:30  29-09-2020
5  town c      town b          15:30        19:30  29-09-2020
Run Code Online (Sandbox Code Playgroud)

我们在不同的城市之间乘车,出发和到达时间。我想删除每一行(旅行),以便我们可以稍后再进行一次旅行并更快地到达。

所以我想得到这个结果:

   origin destination departure_hour arrival_hour        date
1  town a      town b          09:45        10:50  29-09-2020
3  town a      town b          10:30        11:45  29-09-2020
4  town c      town b          14:30        16:30  29-09-2020
5  town c      town b          15:30        19:30  29-09-2020
Run Code Online (Sandbox Code Playgroud)

我可以用这个方法做到这一点:

df['count_utility']=df.apply(lambda x : sum((df['departure_hour']>x.departure_hour)&(df['arrival_hour']<x.arrival_hour)&(df['origin']==x.origin)&(df['destination']==x.destination)&(df['date']==x.date)),axis=1)
Run Code Online (Sandbox Code Playgroud)

然后应用过滤器: df['count_utility']==0

但是这种方法对于我有 100 万行的 Dataframe 来说太慢了。

我认为使用基于出发地、目的地和日期的分组可能会更快,但我不知道该怎么做。

jez*_*ael 4

在每组自定义函数中使用 numpy 广播的一个想法GroupBy.apply

def f(x):
    a = x['departure_hour'].to_numpy()
    b = x['arrival_hour'].to_numpy()
    m = (a > a[:, None]) & (b < b[:, None])
    x['count_utility']  = m.sum(axis=1)
    return x

df = df.groupby(['origin','destination','date']).apply(f)
print (df)
   origin destination departure_hour arrival_hour        date  count_utility
0  town a      town b          09:30        11:30  29-09-2020              1
1  town a      town b          09:45        10:50  29-09-2020              0
2  town a      town b          10:00        12:00  29-09-2020              1
3  town a      town b          10:30        11:45  29-09-2020              0
4  town c      town b          14:30        16:30  29-09-2020              0
5  town c      town b          15:30        19:30  29-09-2020              0
Run Code Online (Sandbox Code Playgroud)