Mat*_* P. 6 python performance dataframe python-3.x pandas
我有一个这样的数据框:
df = pd.DataFrame({'origin': ['town a', 'town a', 'town a','town a', 'town c', 'town c'],\
'destination': ['town b', 'town b', 'town b','town b','town b','town b'], \
'departure_hour': ['09:30', '09:45','10:00', '10:30','14:30', '15:30'],\
'arrival_hour': ['11:30', '10:50','12:00', '11:45','16:30', '19:30'],\
'date': ['29-09-2020', '29-09-2020','29-09-2020','29-09-2020','29-09-2020','29-09-2020']})
origin destination departure_hour arrival_hour date
0 town a town b 09:30 11:30 29-09-2020
1 town a town b 09:45 10:50 29-09-2020
2 town a town b 10:00 12:00 29-09-2020
3 town a town b 10:30 11:45 29-09-2020
4 town c town b 14:30 16:30 29-09-2020
5 town c town b 15:30 19:30 29-09-2020
Run Code Online (Sandbox Code Playgroud)
我们在不同的城市之间乘车,出发和到达时间。我想删除每一行(旅行),以便我们可以稍后再进行一次旅行并更快地到达。
所以我想得到这个结果:
origin destination departure_hour arrival_hour date
1 town a town b 09:45 10:50 29-09-2020
3 town a town b 10:30 11:45 29-09-2020
4 town c town b 14:30 16:30 29-09-2020
5 town c town b 15:30 19:30 29-09-2020
Run Code Online (Sandbox Code Playgroud)
我可以用这个方法做到这一点:
df['count_utility']=df.apply(lambda x : sum((df['departure_hour']>x.departure_hour)&(df['arrival_hour']<x.arrival_hour)&(df['origin']==x.origin)&(df['destination']==x.destination)&(df['date']==x.date)),axis=1)
Run Code Online (Sandbox Code Playgroud)
然后应用过滤器: df['count_utility']==0
但是这种方法对于我有 100 万行的 Dataframe 来说太慢了。
我认为使用基于出发地、目的地和日期的分组可能会更快,但我不知道该怎么做。
在每组自定义函数中使用 numpy 广播的一个想法GroupBy.apply:
def f(x):
a = x['departure_hour'].to_numpy()
b = x['arrival_hour'].to_numpy()
m = (a > a[:, None]) & (b < b[:, None])
x['count_utility'] = m.sum(axis=1)
return x
df = df.groupby(['origin','destination','date']).apply(f)
print (df)
origin destination departure_hour arrival_hour date count_utility
0 town a town b 09:30 11:30 29-09-2020 1
1 town a town b 09:45 10:50 29-09-2020 0
2 town a town b 10:00 12:00 29-09-2020 1
3 town a town b 10:30 11:45 29-09-2020 0
4 town c town b 14:30 16:30 29-09-2020 0
5 town c town b 15:30 19:30 29-09-2020 0
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
63 次 |
| 最近记录: |