如何将 pyspark 数据框列拆分为两列(下面的示例)?

pro*_*ray 2 split apache-spark apache-spark-sql pyspark

该列在一行中多次使用分隔符,因此split并不那么简单。
拆分时,在这种情况下只需考虑第一个分隔符出现。

截至目前,我正在这样做。

不过我觉得还有更好的解决办法吗

testdf= spark.createDataFrame([("Dog", "meat,bread,milk"), ("Cat", "mouse,fish")],["Animal", "Food"])

testdf.show()

+------+---------------+
|Animal|           Food|
+------+---------------+
|   Dog|meat,bread,milk|
|   Cat|     mouse,fish|
+------+---------------+

testdf.withColumn("Food1", split(col("Food"), ",").getItem(0))\
        .withColumn("Food2",expr("regexp_replace(Food, Food1, '')"))\
        .withColumn("Food2",expr("substring(Food2, 2)")).show()

+------+---------------+-----+----------+
|Animal|           Food|Food1|     Food2|
+------+---------------+-----+----------+
|   Dog|meat,bread,milk| meat|bread,milk|
|   Cat|     mouse,fish|mouse|      fish|
+------+---------------+-----+----------+
Run Code Online (Sandbox Code Playgroud)

mur*_*ash 5

我只会使用string functions,没有看到使用正则表达式的理由。

from pyspark.sql import functions as F

testdf\
      .withColumn("Food1", F.expr("""substring(Food,1,instr(Food,',')-1)"""))\
      .withColumn("Food2", F.expr("""substring(Food,instr(Food,',')+1,length(Food))""")).show()

#+------+---------------+-----+----------+
#|Animal|           Food|Food1|     Food2|
#+------+---------------+-----+----------+
#|   Dog|meat,bread,milk| meat|bread,milk|
#|   Cat|     mouse,fish|mouse|      fish|
#+------+---------------+-----+----------+*
Run Code Online (Sandbox Code Playgroud)