在 PySpark 中将 Row 转换为 List(String)

Question

在 PySpark 中将 Row 转换为 List(String)

use*_*570 2 apache-spark pyspark pyspark-sql

我有行元组格式的数据 -

Row(Sentence=u'When, for the first time I realized the meaning of death.')

Run Code Online (Sandbox Code Playgroud)

我想把它转换成这样的字符串格式 -

(u'When, for the first time I realized the meaning of death.')

Run Code Online (Sandbox Code Playgroud)

我这样试过（假设“a”在行元组中有数据）-

b = sc.parallelize(a)
b = b.map(lambda line: tuple([str(x) for x in line]))
print(b.take(4))

Run Code Online (Sandbox Code Playgroud)

但我得到的结果是这样的 -

[('W', 'h', 'e', 'n', ',', ' ', 'f', 'o', 'r', ' ', 't', 'h', 'e', ' ', 'f', 'i', 'r', 's', 't', ' ', 't', 'i', 'm', 'e', ' ', 'I', ' ', 'r', 'e', 'a', 'l', 'i', 'z', 'e', 'd', ' ', 't', 'h', 'e', ' ', 'm', 'e', 'a', 'n', 'i', 'n', 'g', ' ', 'o', 'f', ' ', 'd', 'e', 'a', 't', 'h', '.')]

Run Code Online (Sandbox Code Playgroud)

有人知道我在这里做错了什么吗？

Answer 1

hi-*_*zir 5

单身Row（你为什么要......）它应该是：

a = Row(Sentence=u'When, for the first time I realized the meaning of death.')

b = sc.parallelize([a])

Run Code Online (Sandbox Code Playgroud)

并压平

b.map(lambda x: x.Sentence)

Run Code Online (Sandbox Code Playgroud)

或者

b.flatMap(lambda x: x)

Run Code Online (Sandbox Code Playgroud)

虽然sc.parallelize(a)已经是你需要的格式 - 因为你通过了Iterable，Spark 会遍历所有的字段Row来创建RDD

Answer 2

小智 5

below is the code:

col = 'your_column_name'
val = df.select(col).collect()
val2 = [ ele.__getattr__(col) for ele in val]

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，1 月前
查看次数：	15175 次
最近记录：	6 年，7 月前