Pyspark数据帧如何在所有列中删除带空值的行?

kww*_*kww 9 python apache-spark apache-spark-sql pyspark pyspark-sql

对于数据框,在它之前:

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|null|null|
|null|   B|  X1|
+----+----+----+
Run Code Online (Sandbox Code Playgroud)

在我希望它之后:

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
+----+----+----+
Run Code Online (Sandbox Code Playgroud)

我更喜欢一种通用的方法,以便它可以在df.columns很长时间内应用.谢谢!

use*_*411 16

na.drop您提供所需的战略:

df = spark.createDataFrame([
    (1, "B", "X1"), (None, None, None), (None, "B", "X1"), (None, "C", None)],
    ("ID", "TYPE", "CODE")
)

df.na.drop(how="all").show()
Run Code Online (Sandbox Code Playgroud)
+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+  
|   1|   B|  X1|
|null|   B|  X1|
|null|   C|null|
+----+----+----+
Run Code Online (Sandbox Code Playgroud)

可以使用threshold(NOT NULL值的数量)实现替代配方:

df.na.drop(thresh=1).show()
Run Code Online (Sandbox Code Playgroud)
+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
|null|   C|null|
+----+----+----+
Run Code Online (Sandbox Code Playgroud)


Psi*_*dom 6

一种选择是用于functools.reduce构造条件:

from functools import reduce
df.filter(~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])).show()
+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
+----+----+----+
Run Code Online (Sandbox Code Playgroud)

其中reduce产生一个查询如下:

~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])
# Column<b'(NOT (((ID IS NULL) AND (TYPE IS NULL)) AND (CODE IS NULL)))'>
Run Code Online (Sandbox Code Playgroud)