如何在spark(Python)中将两个rdd组合成一个rdd

曾逸飞*_*曾逸飞 2 apache-spark rdd pyspark

例如有两个rdd“rdd1 = [[1,2],[3,4]],rdd2 = [[5,6],[7,8]]”。以及如何将两者组合成这种风格:[[1,2,5,6],[3,4,7,8]]。有什么功能可以解决这个问题吗?

Moh*_*hif 5

You need to basically combine your rdds together using rdd.zip() and perform map operation on the resulting rdd to get your desired output :

rdd1 = sc.parallelize([[1,2],[3,4]])
rdd2 = sc.parallelize([[5,6],[7,8]])

#Zip the two rdd together
rdd_temp = rdd1.zip(rdd2)

#Perform Map operation to get your desired output by flattening each element
#Reference : /sf/ask/66704011/
rdd_final = rdd_temp.map(lambda x: [item for sublist in x for item in sublist])

#rdd_final.collect()
#Output : [[1, 2, 5, 6], [3, 4, 7, 8]]
Run Code Online (Sandbox Code Playgroud)

You can also check out the results on the Databricks notebook at this link.