如何在可能为空的列上使用 PySpark CountVectorizer

Question

如何在可能为空的列上使用 PySpark CountVectorizer

Nic*_*ian 3 apache-spark pyspark apache-spark-mllib

我的 Spark DataFrame 中有一个列：

 |-- topics_A: array (nullable = true)
 |    |-- element: string (containsNull = true)

Run Code Online (Sandbox Code Playgroud)

我正在使用 CountVectorizer：

topic_vectorizer_A = CountVectorizer(inputCol="topics_A", outputCol="topics_vec_A")

Run Code Online (Sandbox Code Playgroud)

我得到 NullPointerExceptions，因为有时 topic_A 列包含空值。

有没有解决的办法？用零长度数组填充它可以正常工作（尽管它会大大增加数据大小） - 但我无法弄清楚如何在 PySpark 中的 Array 列上执行 fillNa 。

Answer 1

zer*_*323 6

我个人会删除带有NULL值的列，因为那里没有有用的信息，但您可以用空数组替换空值。首先是一些进口：

from pyspark.sql.functions import when, col, coalesce, array

Run Code Online (Sandbox Code Playgroud)

您可以将特定类型的空数组定义为：

fill = array().cast("array<string>")

Run Code Online (Sandbox Code Playgroud)

并将其与when子句结合起来：

topics_a = when(col("topics_A").isNull(), fill).otherwise(col("topics_A"))

Run Code Online (Sandbox Code Playgroud)

或coalesce：

topics_a = coalesce(col("topics_A"), fill)

Run Code Online (Sandbox Code Playgroud)

并将其用作：

topics_a = coalesce(col("topics_A"), fill)

Run Code Online (Sandbox Code Playgroud)

所以使用示例数据：

df.withColumn("topics_A", topics_a)

Run Code Online (Sandbox Code Playgroud)

结果将是：

df = sc.parallelize([(1, ["a", "b"]), (2, None)]).toDF(["id", "topics_A"])

df_ = df.withColumn("topics_A", topics_a)
topic_vectorizer_A.fit(df_).transform(df_)

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年前
查看次数：	2325 次
最近记录：	7 年，4 月前