PySpark - 将RDD转换为键值对RDD,其值在List中

nik*_*kos 3 key-value apache-spark rdd pyspark

我有一个RDD,元组的形式如下:

[("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2"), ...
Run Code Online (Sandbox Code Playgroud)

我要的是要变换成一个键值对RDD,其中,第一场将是第一个字符串(键)和第二场的字符串(值)的列表,也就是我想要把它转化为如下形式:

[("a1",["b1","c1","d1","e1"]), ("a2",["b2","c2","d2","e2"]), ...
Run Code Online (Sandbox Code Playgroud)

B.M*_*.W. 8

>>> rdd = sc.parallelize([("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2")])

>>> result = rdd.map(lambda x: (x[0], list(x[1:])))

>>> print result.collect()
[('a1', ['b1', 'c1', 'd1', 'e1']), ('a2', ['b2', 'c2', 'd2', 'e2'])]
Run Code Online (Sandbox Code Playgroud)

说明lambda x: (x[0], list(x[1:])):

  1. x[0] 将使第一个元素成为输出的第一个元素
  2. x[1:] 将使除第一个元素之外的所有元素都在第二个元素中
  3. list(x[1:]) 将强制它成为一个列表,因为默认将是一个元组