如何使用每行带有可变分隔符的 split 函数?

vis*_*raj 5 apache-spark apache-spark-sql

Input DF:
+-------------------+---------+
|VALUES             |Delimiter|
+-------------------+---------+
|50000.0#0#0#       |#        |
|0@1000.0@          |@        |
|1$                 |$        |
|1000.00^Test_string|^        |
+-------------------+---------+

Expected Output DF:
+-------------------+---------+----------------------+
|VALUES             |Delimiter|SPLITED_VALUES        |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0]       |
|0@1000.0@          |@        |[0, 1000.0]           |
|1$                 |$        |[1]                   |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

  
Run Code Online (Sandbox Code Playgroud)

代码:

import sparkSession.sqlContext.implicits._
val dept = Seq(("50000.0#0#0#", "#"),("0@1000.0@", "@"),("1$", "$"),("1000.00^Test_string", "^")).toDF("VALUES", "Delimiter")
Run Code Online (Sandbox Code Playgroud)

我对 Spark 非常陌生,尝试使用另一列中的分隔符来拆分“VALUES”列的值。

尝试使用 Spark split 函数作为

val dept2 = dept.withColumn("SPLITED_VALUES", split(col("VALUES"), "#"))
Run Code Online (Sandbox Code Playgroud)

但这里 split 函数将分隔符作为常量值,我无法将其传递为

val dept2 = dept.withColumn("SPLITED_VALUES", split(col("VALUES"), col("Delimiter")))
Run Code Online (Sandbox Code Playgroud)

谁能为此提出更好的方法?

Sri*_*vas 5

检查下面的代码。

scala> df
.withColumn("delimiter",concat(lit("\\"),$"delimiter"))
.withColumn("split_values",expr("split(values,delimiter)"))
.show(false)
+-------------------+---------+----------------------+
|values             |delimiter|split_value           |
+-------------------+---------+----------------------+
|50000.0#0#0#       |\#       |[50000.0, 0, 0, ]     |
|0@1000.0@          |\@       |[0, 1000.0, ]         |
|1$                 |\$       |[1, ]                 |
|1000.00^Test_string|\^       |[1000.00, Test_string]|
+-------------------+---------+----------------------+
Run Code Online (Sandbox Code Playgroud)

更新

scala> df
.withColumn("delimiter",concat(lit("\\"),$"delimiter"))
.withColumn("data",expr("array_remove(split(trim(values),delimiter),'')"))
.show(false)

+-------------------+---------+----------------------+
|values             |delimiter|data                  |
+-------------------+---------+----------------------+
|50000.0#0#0#       |\#       |[50000.0, 0, 0]       |
|0@1000.0@          |\@       |[0, 1000.0]           |
|1$                 |\$       |[1]                   |
|1000.00^Test_string|\^       |[1000.00, Test_string]|
+-------------------+---------+----------------------+
Run Code Online (Sandbox Code Playgroud)