根据列的数据类型在pyspark数据框中填充空值

Question

根据列的数据类型在pyspark数据框中填充空值

假设我有一个示例数据框，如下所示：

+-----+----+----+
| col1|col2|col3|
+-----+----+----+
|  cat|  10| 1.5|
|  dog|  20| 9.0|
| null|  30|null|
|mouse|null|15.3|
+-----+----+----+

Run Code Online (Sandbox Code Playgroud)

我想根据数据类型填充空值。例如，对于字符串类型，我想填充“N/A”，对于整数类型，我想添加 0。同样，对于浮点数，我想添加 0.0。

我尝试使用 df.fillna() 但后来我意识到可能有“N”列，所以我想要一个动态解决方案。

Answer 1

Sur*_*ali 5

df.dtypes给你一个元组(column_name, data_type)。它可用于获取中的 string/int/float 列名称列表df。对这些列进行fillna()相应的子集化。

df = sc.parallelize([['cat', 10, 1.5], ['dog', 20, 9.0],\
                 [None, 30, None], ['mouse', None, 15.3]])\
                 .toDF(['col1', 'col2', 'col3'])

string_col = [item[0] for item in df.dtypes if item[1].startswith('string')]
big_int_col = [item[0] for item in df.dtypes if item[1].startswith('bigint')]
double_col = [item[0] for item in df.dtypes if item[1].startswith('double')]

df.fillna('N/A', subset = string_col)\
        .fillna(0, subset = big_int_col)\
        .fillna(0.0, subset = double_col)\
        .show()

Run Code Online (Sandbox Code Playgroud)

输出：

+-----+----+----+
| col1|col2|col3|
+-----+----+----+
|  cat|  10| 1.5|
|  dog|  20| 9.0|
|  N/A|  30| 0.0|
|mouse|   0|15.3|
+-----+----+----+

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，1 月前
查看次数：	2212 次
最近记录：	5 年，1 月前