注意:寻找一些有效方法的帮助,除了大型连接,然后计算日期之间的差异
我有table1国家ID和日期(没有这些值的重复),我想总结table2信息(其中包含country,date,cluster_x和count变量,其中cluster_x是cluster_1,cluster_2,cluster_3),因此table1每个都附加了集群ID的值以及table2日期table2之前30天内发生日期的汇总计数table1.
我相信这在SQL中很简单:如何在Pandas中做到这一点?
select a.date,a.country,
sum(case when a.date - b.date between 1 and 30 then b.cluster_1 else 0 end) as cluster1,
sum(case when a.date - b.date between 1 and 30 then b.cluster_2 else 0 end) as cluster2,
sum(case when a.date - b.date between 1 and 30 then b.cluster_3 else 0 end) as cluster3
from table1 a
left outer join table2 b
on a.country=b.country
group by a.date,a.country
Run Code Online (Sandbox Code Playgroud)
编辑:
这是一个有点改变的例子.假设这是table1,一个包含日期,城市,集群和计数的聚合数据集.下面是"查询"数据集(table2).在这种情况下,只要table1中的日期字段在30天之前,我们希望将table1的count字段与cluster1,cluster2,cluster3(实际上有100个)对应于country id的count字段相加.
因此,例如,查询数据集的第一行具有日期2/2/2015和国家1.在表1中,在30天之前只有一行,而对于具有计数2的集群2.
以下是CSV中两个表的转储:
date,country,cluster,count
2014-01-30,1,1,1
2015-02-03,1,1,3
2015-01-30,1,2,2
2015-04-15,1,2,5
2015-03-01,2,1,6
2015-07-01,2,2,4
2015-01-31,2,3,8
2015-01-21,2,1,2
2015-01-21,2,1,3
Run Code Online (Sandbox Code Playgroud)
和table2:
date,country
2015-02-01,1
2015-04-21,1
2015-02-21,2
Run Code Online (Sandbox Code Playgroud)
编辑:哎呀 - 希望我能在提交之前看到有关加入的编辑。Np,我会留下这个,因为这是有趣的练习。欢迎批评指正。
如果 table1 和 table2 与此脚本位于同一目录中的“table1.csv”和“table2.csv”,则这应该可以工作。
我在 30 天的时间内没有得到与你的示例相同的结果 - 不得不将其增加到 31 天,但我认为精神就在这里:
import pandas as pd
import numpy as np
table1_path = './table1.csv'
table2_path = './table2.csv'
with open(table1_path) as f:
table1 = pd.read_csv(f)
table1.date = pd.to_datetime(table1.date)
with open(table2_path) as f:
table2 = pd.read_csv(f)
table2.date = pd.to_datetime(table2.date)
joined = pd.merge(table2, table1, how='outer', on=['country'])
joined['datediff'] = joined.date_x - joined.date_y
filtered = joined[(joined.datediff >= np.timedelta64(1, 'D')) & (joined.datediff <= np.timedelta64(31, 'D'))]
gb_date_x = filtered.groupby(['date_x', 'country', 'cluster'])
summed = pd.DataFrame(gb_date_x['count'].sum())
result = summed.unstack()
result.reset_index(inplace=True)
result.fillna(0, inplace=True)
Run Code Online (Sandbox Code Playgroud)
我的测试输出:
ipdb> table1
date country cluster count
0 2014-01-30 00:00:00 1 1 1
1 2015-02-03 00:00:00 1 1 3
2 2015-01-30 00:00:00 1 2 2
3 2015-04-15 00:00:00 1 2 5
4 2015-03-01 00:00:00 2 1 6
5 2015-07-01 00:00:00 2 2 4
6 2015-01-31 00:00:00 2 3 8
7 2015-01-21 00:00:00 2 1 2
8 2015-01-21 00:00:00 2 1 3
ipdb> table2
date country
0 2015-02-01 00:00:00 1
1 2015-04-21 00:00:00 1
2 2015-02-21 00:00:00 2
Run Code Online (Sandbox Code Playgroud)
...
ipdb> result
date_x country count
cluster 1 2 3
0 2015-02-01 00:00:00 1 0 2 0
1 2015-02-21 00:00:00 2 5 0 8
2 2015-04-21 00:00:00 1 0 5 0
Run Code Online (Sandbox Code Playgroud)