nik*_*kos 3 key-value apache-spark rdd pyspark
我有一个RDD,元组的形式如下:
[("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2"), ...
Run Code Online (Sandbox Code Playgroud)
我要的是要变换成一个键值对RDD,其中,第一场将是第一个字符串(键)和第二场的字符串(值)的列表,也就是我想要把它转化为如下形式:
[("a1",["b1","c1","d1","e1"]), ("a2",["b2","c2","d2","e2"]), ...
Run Code Online (Sandbox Code Playgroud)
>>> rdd = sc.parallelize([("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2")])
>>> result = rdd.map(lambda x: (x[0], list(x[1:])))
>>> print result.collect()
[('a1', ['b1', 'c1', 'd1', 'e1']), ('a2', ['b2', 'c2', 'd2', 'e2'])]
Run Code Online (Sandbox Code Playgroud)
说明lambda x: (x[0], list(x[1:]))
:
x[0]
将使第一个元素成为输出的第一个元素 x[1:]
将使除第一个元素之外的所有元素都在第二个元素中 list(x[1:])
将强制它成为一个列表,因为默认将是一个元组 归档时间: |
|
查看次数: |
12786 次 |
最近记录: |