如何将列临时存储为json对象以派生其他列?

For*_*sed 5 apache-spark

我有一个像这样的键值对的数据集

likes=dogs;hates=birds;likes=sports;eats=cheese
Run Code Online (Sandbox Code Playgroud)

然后我把它变成了json

{"likes": ["dogs","sports"], "hates": ["birds"], "eats": ["cheese"]}
Run Code Online (Sandbox Code Playgroud)

有没有办法可以保留这个json数据结构而不将其转换为字符串,所以我可以逐行从它派生更多列?我希望它看起来像这样,而不必每次添加列都将字符串中的json解码.

        Dataset<Row> df1 = df.withColumn("interests", callUDF("to_json", col("interests")))
                         .withColumn("likes", callUDF("extract_from_json", "likes", col("interests")))
                         .withColumn("hates", callUDF("extract_from_json", "hates", col("interests")))
                         .withColumn("hates", callUDF("extract_from_json", "eats", col("interests")));
Run Code Online (Sandbox Code Playgroud)

ayp*_*lam 3

如果您正在处理原始文件

likes=dogs;hates=birds;likes=sports;eats=cheese
Run Code Online (Sandbox Code Playgroud)

然后你可以用 sc.textFile 读入它,然后进行一些简单的 RDD 操作。

val df = sc.textFile(file)
  .flatMap(x => x.split(";"))
  .map(x => (x.split("=")(0), x.split("=")(1)))
  .toDF("interest","value")

df.withColumn("tmp",lit(1)).groupBy("tmp").pivot("interest").agg(collect_list("value"))

+---+--------+-------+--------------+
|tmp|    eats|  hates|         likes|
+---+--------+-------+--------------+
|  1|[cheese]|[birds]|[dogs, sports]|
+---+--------+-------+--------------+
Run Code Online (Sandbox Code Playgroud)