如何在Pyspark中替换数据框的所有Null值

Question

如何在Pyspark中替换数据框的所有Null值

我在pyspark中有一个包含300多列的数据框.在这些列中,有一些值为null的列.

例如:

Column_1 column_2
null     null
null     null
234      null
125      124
365      187
and so on

Run Code Online (Sandbox Code Playgroud)

当我想做一个column_1的总和时,我得到的是Null,而不是724.

现在我想用空格替换数据框的所有列中的null.因此,当我尝试对这些列求和时,我没有得到空值,但我会得到一个数值.

我们怎样才能在pyspark实现这一目标

Answer 1

Mar*_*usz 49

您可以使用df.na.fill零替换空值,例如:

>>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], ['col'])
>>> df.show()
+----+
| col|
+----+
|   1|
|   2|
|   3|
|null|
+----+

>>> df.na.fill(0).show()
+---+
|col|
+---+
|  1|
|  2|
|  3|
|  0|
+---+

Run Code Online (Sandbox Code Playgroud)

Answer 2

Dug*_*jay 31

你可以使用fillna()函数.

>>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], ['col'])
>>> df.show()
+----+
| col|
+----+
|   1|
|   2|
|   3|
|null|
+----+

>>> df = df.fillna({'col':'4'})
>>> df.show()

or df.fillna({'col':'4'}).show()

+---+
|col|
+---+
|  1|
|  2|
|  3|
|  4|
+---+

Run Code Online (Sandbox Code Playgroud)

此功能是首选功能，因为您可以指定要使用的列。 (3认同)

Answer 3

Dan*_*rod 9

使用时fillna有 3 个选项...

文档：

def fillna(self, value, subset=None):
   """Replace null values, alias for ``na.fill()``.
   :func:`DataFrame.fillna` and :func:`DataFrameNaFunctions.fill` are aliases of each other.

   :param value: int, long, float, string, bool or dict.
       Value to replace null values with.
       If the value is a dict, then `subset` is ignored and `value` must be a mapping
       from column name (string) to replacement value. The replacement value must be
       an int, long, float, boolean, or string.
   :param subset: optional list of column names to consider.
       Columns specified in subset that do not have matching data type are ignored.
       For example, if `value` is a string, and subset contains a non-string column,
       then the non-string column is simply ignored.

Run Code Online (Sandbox Code Playgroud)

所以你可以：

用相同的值填充所有列：df.fillna(value)
传递列 --> 值的字典：df.fillna(dict_of_col_to_value)
传递一个列列表以填充相同的值：df.fillna(value, subset=list_of_cols)

fillna()是别名，na.fill()所以它们是相同的。

归档时间：	8 年，8 月前
查看次数：	44987 次
最近记录：	7 年，12 月前