PySpark groupByKey返回pyspark.resultiterable.ResultIterable

the*_*ing 46 python apache-spark pyspark

我想弄清楚为什么我的groupByKey返回以下内容:

[(0, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a210>), (1, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a4d0>), (2, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a390>), (3, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a290>), (4, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a450>), (5, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a350>), (6, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a1d0>), (7, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a490>), (8, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a050>), (9, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a650>)]
Run Code Online (Sandbox Code Playgroud)

我有flatMapped值,如下所示:

[(0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D')]
Run Code Online (Sandbox Code Playgroud)

我做的很简单:

groupRDD = columnRDD.groupByKey()
Run Code Online (Sandbox Code Playgroud)

dpe*_*ock 66

你得到的是一个允许你迭代结果的对象.您可以通过调用值上的list()将groupByKey的结果转换为列表,例如

example = sc.parallelize([(0, u'D'), (0, u'D'), (1, u'E'), (2, u'F')])

example.groupByKey().collect()
# Gives [(0, <pyspark.resultiterable.ResultIterable object ......]

example.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()
# Gives [(0, [u'D', u'D']), (1, [u'E']), (2, [u'F'])]
Run Code Online (Sandbox Code Playgroud)

  • `example.groupByKey().mapValues(list).collect()`更短,也可以 (31认同)
  • 如何通过`ResultIterable`类型进行映射? (3认同)

Jay*_*ram 24

你也可以用

example.groupByKey().mapValues(list)
Run Code Online (Sandbox Code Playgroud)