pro*_*ray 2 split apache-spark apache-spark-sql pyspark
该列在一行中多次使用分隔符,因此split并不那么简单。
拆分时,在这种情况下只需考虑第一个分隔符出现。
截至目前,我正在这样做。
不过我觉得还有更好的解决办法吗?
testdf= spark.createDataFrame([("Dog", "meat,bread,milk"), ("Cat", "mouse,fish")],["Animal", "Food"])
testdf.show()
+------+---------------+
|Animal| Food|
+------+---------------+
| Dog|meat,bread,milk|
| Cat| mouse,fish|
+------+---------------+
testdf.withColumn("Food1", split(col("Food"), ",").getItem(0))\
.withColumn("Food2",expr("regexp_replace(Food, Food1, '')"))\
.withColumn("Food2",expr("substring(Food2, 2)")).show()
+------+---------------+-----+----------+
|Animal| Food|Food1| Food2|
+------+---------------+-----+----------+
| Dog|meat,bread,milk| meat|bread,milk|
| Cat| mouse,fish|mouse| fish|
+------+---------------+-----+----------+
Run Code Online (Sandbox Code Playgroud)
我只会使用string functions,没有看到使用正则表达式的理由。
from pyspark.sql import functions as F
testdf\
.withColumn("Food1", F.expr("""substring(Food,1,instr(Food,',')-1)"""))\
.withColumn("Food2", F.expr("""substring(Food,instr(Food,',')+1,length(Food))""")).show()
#+------+---------------+-----+----------+
#|Animal| Food|Food1| Food2|
#+------+---------------+-----+----------+
#| Dog|meat,bread,milk| meat|bread,milk|
#| Cat| mouse,fish|mouse| fish|
#+------+---------------+-----+----------+*
Run Code Online (Sandbox Code Playgroud)