我在pyspark中有一个包含300多列的数据框.在这些列中,有一些值为null的列.
例如:
Column_1 column_2
null null
null null
234 null
125 124
365 187
and so on
Run Code Online (Sandbox Code Playgroud)
当我想做一个column_1的总和时,我得到的是Null,而不是724.
现在我想用空格替换数据框的所有列中的null.因此,当我尝试对这些列求和时,我没有得到空值,但我会得到一个数值.
我们怎样才能在pyspark实现这一目标
Mar*_*usz 49
您可以使用df.na.fill零替换空值,例如:
>>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], ['col'])
>>> df.show()
+----+
| col|
+----+
| 1|
| 2|
| 3|
|null|
+----+
>>> df.na.fill(0).show()
+---+
|col|
+---+
| 1|
| 2|
| 3|
| 0|
+---+
Run Code Online (Sandbox Code Playgroud)
Dug*_*jay 31
你可以使用fillna()函数.
>>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], ['col'])
>>> df.show()
+----+
| col|
+----+
| 1|
| 2|
| 3|
|null|
+----+
>>> df = df.fillna({'col':'4'})
>>> df.show()
or df.fillna({'col':'4'}).show()
+---+
|col|
+---+
| 1|
| 2|
| 3|
| 4|
+---+
Run Code Online (Sandbox Code Playgroud)
使用时fillna有 3 个选项...
文档:
Run Code Online (Sandbox Code Playgroud)def fillna(self, value, subset=None): """Replace null values, alias for ``na.fill()``. :func:`DataFrame.fillna` and :func:`DataFrameNaFunctions.fill` are aliases of each other. :param value: int, long, float, string, bool or dict. Value to replace null values with. If the value is a dict, then `subset` is ignored and `value` must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, boolean, or string. :param subset: optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored. For example, if `value` is a string, and subset contains a non-string column, then the non-string column is simply ignored.
所以你可以:
df.fillna(value)df.fillna(dict_of_col_to_value)df.fillna(value, subset=list_of_cols)fillna()是别名,na.fill()所以它们是相同的。