我有一个像这样的键值对的数据集
likes=dogs;hates=birds;likes=sports;eats=cheese
Run Code Online (Sandbox Code Playgroud)
然后我把它变成了json
{"likes": ["dogs","sports"], "hates": ["birds"], "eats": ["cheese"]}
Run Code Online (Sandbox Code Playgroud)
有没有办法可以保留这个json数据结构而不将其转换为字符串,所以我可以逐行从它派生更多列?我希望它看起来像这样,而不必每次添加列都将字符串中的json解码.
Dataset<Row> df1 = df.withColumn("interests", callUDF("to_json", col("interests")))
.withColumn("likes", callUDF("extract_from_json", "likes", col("interests")))
.withColumn("hates", callUDF("extract_from_json", "hates", col("interests")))
.withColumn("hates", callUDF("extract_from_json", "eats", col("interests")));
Run Code Online (Sandbox Code Playgroud)
如果您正在处理原始文件
likes=dogs;hates=birds;likes=sports;eats=cheese
Run Code Online (Sandbox Code Playgroud)
然后你可以用 sc.textFile 读入它,然后进行一些简单的 RDD 操作。
val df = sc.textFile(file)
.flatMap(x => x.split(";"))
.map(x => (x.split("=")(0), x.split("=")(1)))
.toDF("interest","value")
df.withColumn("tmp",lit(1)).groupBy("tmp").pivot("interest").agg(collect_list("value"))
+---+--------+-------+--------------+
|tmp| eats| hates| likes|
+---+--------+-------+--------------+
| 1|[cheese]|[birds]|[dogs, sports]|
+---+--------+-------+--------------+
Run Code Online (Sandbox Code Playgroud)