arm*_*ong 5 scala apache-spark
我有rdd项目,如:
(3922774869,10,1)
(3922774869,11,1)
(3922774869,12,2)
(3922774869,13,2)
(1779744180,10,1)
(1779744180,11,1)
(3922774869,14,3)
(3922774869,15,2)
(1779744180,16,1)
(3922774869,12,1)
(3922774869,13,1)
(1779744180,14,1)
(1779744180,15,1)
(1779744180,16,1)
(3922774869,14,2)
(3922774869,15,1)
(1779744180,16,1)
(1779744180,17,1)
(3922774869,16,4)
...
Run Code Online (Sandbox Code Playgroud)
表示(id, age, count)
并且我想将这些行分组以生成数据集,其中每一行代表每个id的年龄分布,如下所示((id, age)
是uniq):
(1779744180, (10,1), (11,1), (12,2), (13,2) ...)
(3922774869, (10,1), (11,1), (12,3), (13,4) ...)
Run Code Online (Sandbox Code Playgroud)
是的 (id, (age, count), (age, count) ...)
有人能给我一个线索吗?
您可以先减少两个字段,然后使用groupBy:
rdd
.map { case (id, age, count) => ((id, age), count) }.reduceByKey(_ + _)
.map { case ((id, age), count) => (id, (age, count)) }.groupByKey()
Run Code Online (Sandbox Code Playgroud)
返回一个RDD[(Long, Iterable[(Int, Int)])]
,对于上面的输入,它将包含这两个记录:
(1779744180,CompactBuffer((16,3), (15,1), (14,1), (11,1), (10,1), (17,1)))
(3922774869,CompactBuffer((11,1), (12,3), (16,4), (13,3), (15,3), (10,1), (14,5)))
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
5605 次 |
最近记录: |