anv*_*esh 4 python apache-spark pyspark
我有两个rdds:
rdd1 = sc.parallelize([("www.page1.html", "word1"), ("www.page2.html", "word1"),
("www.page1.html", "word3")])
rdd2 = sc.parallelize([("www.page1.html", 7.3), ("www.page2.html", 1.25),
("www.page3.html", 5.41)])
intersection_rdd = rdd1.keys().intersection(rdd2.keys())
Run Code Online (Sandbox Code Playgroud)
//当我这样做的时候,我只得到了交叉的关键点(www.page1.html,www.page2.html).
但我需要键和两个rdds的值.输出应如下所示:
[www.page1.html, (word1, word3, 7.3)]
[www.page2.html, (word1, 1.25)]
Run Code Online (Sandbox Code Playgroud)
你可以举例说明cogroup并过滤:
## This depends on empty resultiterable.ResultIterable
## evaluating to False
intersection_rdd = rdd1.cogroup(rdd2).filter(lambda x: x[1][0] and x[1][1])
intersection_rdd.map(lambda x: (x[0], (list(x[1][0]), list(x[1][1])))).collect()
## [('www.page1.html', (['word1', 'word3'], [7.3])),
## ('www.page2.html', (['word1'], [1.25]))]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3066 次 |
| 最近记录: |