使用 PySpark 从字符串中获取倒数第二个单词

Joh*_*Doe 2 split apache-spark apache-spark-sql pyspark

我需要从字符串值中获取倒数第二个单词。

df = spark.createDataFrame([
  ["sample text 1 AFTEDGH XX"],
  ["sample text 2 GDHDH ZZ"],
  ["sample text 3 JEYHEHH YY"],
  ["sample text 4 QPRYRT EB"],
  ["sample text 5 KENBFBF XX"]
]).toDF("line")

+--------+
|word    |
+--------+
|AFTEDGH |
|GDHDH   |
|JEYHEHH |
|QPRYRT  |
|KENBFBF |
+--------+
Run Code Online (Sandbox Code Playgroud)

我试过:

df_new = df.withColumn('word', F.split(F.col('line'), ' ')[-2])

df_new = df.withColumn('word', F.reverse(F.split(F.col('line'), ' '))[-2])
Run Code Online (Sandbox Code Playgroud)

但他们返回 Null

mck*_*mck 5

要使用负索引,您可以使用element_at

import pyspark.sql.functions as F

df2 = df.withColumn('word', F.element_at(F.split(F.col('line'), ' '), -2))

df2.show(truncate=False)
+------------------------+-------+
|line                    |word   |
+------------------------+-------+
|sample text 1 AFTEDGH XX|AFTEDGH|
|sample text 2 GDHDH ZZ  |GDHDH  |
|sample text 3 JEYHEHH YY|JEYHEHH|
|sample text 4 QPRYRT EB |QPRYRT |
|sample text 5 KENBFBF XX|KENBFBF|
+------------------------+-------+
Run Code Online (Sandbox Code Playgroud)

您的第二次尝试几乎是正确的 - 只需使用正索引,因为您已经反转了数组,并且记住还要从索引中减去 1:

df2 = df.withColumn('word', F.reverse(F.split(F.col('line'), ' '))[1])
Run Code Online (Sandbox Code Playgroud)