如何将管道分隔列拆分成多行?

Lec*_*ico 13 apache-spark apache-spark-sql

我有一个包含以下内容的数据框:

movieId / movieName / genre
1         example1    action|thriller|romance
2         example2    fantastic|action
Run Code Online (Sandbox Code Playgroud)

我想获得第二个数据帧(来自第一个),其中包含以下内容:

movieId / movieName / genre
1         example1    action
1         example1    thriller
1         example1    romance
2         example2    fantastic
2         example2    action
Run Code Online (Sandbox Code Playgroud)

我怎么能这样做?

Jac*_*ski 25

我使用split标准功能.

scala> movies.show(truncate = false)
+-------+---------+-----------------------+
|movieId|movieName|genre                  |
+-------+---------+-----------------------+
|1      |example1 |action|thriller|romance|
|2      |example2 |fantastic|action       |
+-------+---------+-----------------------+

scala> movies.withColumn("genre", explode(split($"genre", "[|]"))).show
+-------+---------+---------+
|movieId|movieName|    genre|
+-------+---------+---------+
|      1| example1|   action|
|      1| example1| thriller|
|      1| example1|  romance|
|      2| example2|fantastic|
|      2| example2|   action|
+-------+---------+---------+

// You can use \\| for split instead
scala> movies.withColumn("genre", explode(split($"genre", "\\|"))).show
+-------+---------+---------+
|movieId|movieName|    genre|
+-------+---------+---------+
|      1| example1|   action|
|      1| example1| thriller|
|      1| example1|  romance|
|      2| example2|fantastic|
|      2| example2|   action|
+-------+---------+---------+
Run Code Online (Sandbox Code Playgroud)

ps你可以Dataset.flatMap用来实现相同的结果,这是Scala开发人员会更喜欢的东西.