pyspark RDD将行扩展为多行

Fra*_*nch 2 python apache-spark rdd pyspark

我在pyspark中有以下RDD,我相信这应该很简单,但是无法弄清楚:

information = [ (10, 'sentence number one'),
                (17, 'longer sentence number two') ]

rdd = sc.parallelize(information)
Run Code Online (Sandbox Code Playgroud)

我需要应用一个转换,将RDD转换为:

[ ('sentence', 10),
  ('number', 10),
  ('one', 10),
  ('longer', 17),
  ('sentence', 17),
  ('number', 17),
  ('two', 17) ]
Run Code Online (Sandbox Code Playgroud)

基本上将句子键扩展为多行,并将单词作为键.

我想避免使用SQL.

Psi*_*dom 5

用途flatMap:

rdd.flatMap(lambda x: [(w, x[0]) for w in x[1].split()])
Run Code Online (Sandbox Code Playgroud)

示例:

rdd.flatMap(lambda x: [(w, x[0]) for w in x[1].split()]).collect()
# [('sentence', 10), ('number', 10), ('one', 10), ('longer', 17), ('sentence', 17), ('number', 17), ('two', 17)]
Run Code Online (Sandbox Code Playgroud)