kww*_*kww 9 python apache-spark apache-spark-sql pyspark pyspark-sql
对于数据框,在它之前:
+----+----+----+
| ID|TYPE|CODE|
+----+----+----+
| 1| B| X1|
|null|null|null|
|null| B| X1|
+----+----+----+
Run Code Online (Sandbox Code Playgroud)
在我希望它之后:
+----+----+----+
| ID|TYPE|CODE|
+----+----+----+
| 1| B| X1|
|null| B| X1|
+----+----+----+
Run Code Online (Sandbox Code Playgroud)
我更喜欢一种通用的方法,以便它可以在df.columns很长时间内应用.谢谢!
use*_*411 16
为na.drop您提供所需的战略:
df = spark.createDataFrame([
(1, "B", "X1"), (None, None, None), (None, "B", "X1"), (None, "C", None)],
("ID", "TYPE", "CODE")
)
df.na.drop(how="all").show()
Run Code Online (Sandbox Code Playgroud)
+----+----+----+
| ID|TYPE|CODE|
+----+----+----+
| 1| B| X1|
|null| B| X1|
|null| C|null|
+----+----+----+
Run Code Online (Sandbox Code Playgroud)
可以使用threshold(NOT NULL值的数量)实现替代配方:
df.na.drop(thresh=1).show()
Run Code Online (Sandbox Code Playgroud)
+----+----+----+
| ID|TYPE|CODE|
+----+----+----+
| 1| B| X1|
|null| B| X1|
|null| C|null|
+----+----+----+
Run Code Online (Sandbox Code Playgroud)
一种选择是用于functools.reduce构造条件:
from functools import reduce
df.filter(~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])).show()
+----+----+----+
| ID|TYPE|CODE|
+----+----+----+
| 1| B| X1|
|null| B| X1|
+----+----+----+
Run Code Online (Sandbox Code Playgroud)
其中reduce产生一个查询如下:
~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])
# Column<b'(NOT (((ID IS NULL) AND (TYPE IS NULL)) AND (CODE IS NULL)))'>
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6422 次 |
| 最近记录: |