Kar*_*dol 8 scala user-defined-functions apache-spark
我想删除字符串从col1存在于col2:
val df = spark.createDataFrame(Seq(
("Hi I heard about Spark", "Spark"),
("I wish Java could use case classes", "Java"),
("Logistic regression models are neat", "models")
)).toDF("sentence", "label")
Run Code Online (Sandbox Code Playgroud)
使用regexp_replace或translateref:spark函数api
val res = df.withColumn("sentence_without_label", regexp_replace
(col("sentence") , "(?????)", "" ))
Run Code Online (Sandbox Code Playgroud)
所以res看起来如下:
你可以简单地使用 regexp_replace
df5.withColumn("sentence_without_label", regexp_replace($"sentence" , lit($"label"), lit("" )))
Run Code Online (Sandbox Code Playgroud)
或者您可以使用简单的udf函数,如下所示
val df5 = spark.createDataFrame(Seq(
("Hi I heard about Spark", "Spark"),
("I wish Java could use case classes", "Java"),
("Logistic regression models are neat", "models")
)).toDF("sentence", "label")
val replace = udf((data: String , rep : String)=>data.replaceAll(rep, ""))
val res = df5.withColumn("sentence_without_label", replace($"sentence" , $"label"))
res.show()
Run Code Online (Sandbox Code Playgroud)
输出:
+-----------------------------------+------+------------------------------+
|sentence |label |sentence_without_label |
+-----------------------------------+------+------------------------------+
|Hi I heard about Spark |Spark |Hi I heard about |
|I wish Java could use case classes |Java |I wish could use case classes|
|Logistic regression models are neat|models|Logistic regression are neat |
+-----------------------------------+------+------------------------------+
Run Code Online (Sandbox Code Playgroud)
如果label它只是一个文字它很简单:
import org.apache.spark.sql.functions._
df.withColumn("sentence_without_label",
regexp_replace(col("sentence"), col("label"), lit(""))).show(false)
+-----------------------------------+------+------------------------------+
|sentence |label |sentence_without_label |
+-----------------------------------+------+------------------------------+
|Hi I heard about Spark |Spark |Hi I heard about |
|I wish Java could use case classes |Java |I wish could use case classes|
|Logistic regression models are neat|models|Logistic regression are neat |
+-----------------------------------+------+------------------------------+
Run Code Online (Sandbox Code Playgroud)
在Spark 1.6中,您可以执行以下操作expr:
df.withColumn(
"sentence_without_label",
expr("regexp_replace(sentence, label, '')"))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
18830 次 |
| 最近记录: |