小编Shi*_*nar的帖子

无法理解AggregateByKey和CombineByKey的工作

我是学习Apache Spark的初学者。目前,我正在尝试使用Python学习各种聚合。

为了给我所面临的问题提供一些背景信息,我发现很难理解aggregateByKey函数根据“状态”计算订单数的工作。

我正在跟踪ITVersity的YouTube播放列表,以下是我正在使用的代码和一些示例输出。

ordersRDD = sc.textFile("/user/cloudera/sqoop_import/orders")
for i in ordersRDD.take(10): print(i)
Run Code Online (Sandbox Code Playgroud)

输出:
1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,关闭
5,2013-07-25 00:00:00.0,11318,完成
6,2013-07-25 00:00:00.0,7130,完成
7, 2013-07-25 00:00:00.0,4530,COMPLETE
8,2013-07-25 00:00:00.0,2911,处理
9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT
10,2013- 07-25 00:00:00.0,5648,PENDING_PAYMENT

ordersMap = ordersRDD.map(lambda x: (x.split(",")[3], x))
Run Code Online (Sandbox Code Playgroud)

输出:
(u'CLOSED',u'1,2013-07-25 00:00:00.0,11599,CLOSED')
(u'PENDING_PAYMENT',u'2,2013-07-25 00:00:00.0,256 ,PENDING_PAYMENT')
(u'COMPLETE',u'3,2013-07-25 00:00:00.0,12111,COMPLETE')
(u'CLOSED',u'4,2013-07-25 00:00:00.0 ,8827,CLOSED')
(u'COMPLETE',u'5,2013-07-25 00:00:00.0,11318,COMPLETE')
(u'COMPLETE',u'6,2013-07-25 00:00 :00.0,7130,COMPLETE')
(u'COMPLETE',u'7,2013-07-25 00:00:00.0,4530,COMPLETE')
(u'PROCESSING',u'8,2013-07-25 00 :00:00.0,2911,PROCESSING')
(u'PENDING_PAYMENT',u'9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT')
(u'PENDING_PAYMENT',u'10,2013-07- 25 00:00:00.0, 5648,PENDING_PAYMENT')

ordersByStatus = ordersMap.aggregateByKey(0, lambda acc, val: acc + 1, lambda acc,val: acc …
Run Code Online (Sandbox Code Playgroud)

apache-spark pyspark

3
推荐指数
1
解决办法
446
查看次数

标签 统计

apache-spark ×1

pyspark ×1