如果 spark 数据帧的特定列中的所有条目都为空，则删除

Question

如果 spark 数据帧的特定列中的所有条目都为空，则删除

使用 Pyspark，如何选择/保留包含非空值的 DataFrame 的所有列；或等效地删除所有不包含数据的列。

编辑：根据苏雷什请求，

for column in media.columns:
    if media.select(media[column]).distinct().count() == 1:
        media = media.drop(media[column])

Run Code Online (Sandbox Code Playgroud)

这里我假设如果count是1，那么它应该是Nan。但我想看看那是不是南。如果有任何其他内置的火花功能，请告诉我。

Answer 1

Sur*_*esh 7

我试过了。说，我有一个如下的数据框，

from pyspark.sql import functions as F

>>> df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   2|null|
|null|   3|null|
|   5|null|null|
+----+----+----+

>>> df1 = df.agg(*[F.count(c).alias(c) for c in df.columns])
>>> df1.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   2|   2|   0|
+----+----+----+

>>> nonNull_cols = [c for c in df1.columns if df1[[c]].first()[c] > 0]
>>> df = df.select(*nonNull_cols)
>>> df.show()
+----+----+
|col1|col2|
+----+----+
|   1|   2|
|null|   3|
|   5|null|
+----+----+

Run Code Online (Sandbox Code Playgroud)

Answer 2

小智 6

这是一个更有效的解决方案，不涉及循环列。当您有很多列时，速度会快得多。我在具有 800 列的数据帧上测试了其他方法，运行时间为 17 分钟。以下方法在我对同一数据集的测试中只需要 1 分钟。

def drop_fully_null_columns(df, but_keep_these=[]):
    """Drops DataFrame columns that are fully null
    (i.e. the maximum value is null)

    Arguments:
        df {spark DataFrame} -- spark dataframe
        but_keep_these {list} -- list of columns to keep without checking for nulls

    Returns:
        spark DataFrame -- dataframe with fully null columns removed
    """

    # skip checking some columns
    cols_to_check = [col for col in df.columns if col not in but_keep_these]
    if len(cols_to_check) > 0:
        # drop columns for which the max is None
        rows_with_data = df.select(*cols_to_check).groupby().agg(*[F.max(c).alias(c) for c in cols_to_check]).take(1)[0]
        cols_to_drop = [c for c, const in rows_with_data.asDict().items() if const == None]
        new_df = df.drop(*cols_to_drop)

        return new_df
    else:
        return df

Run Code Online (Sandbox Code Playgroud)

Answer 3

Pin*_*ntu 2

这样做的间接方法之一是

import pyspark.sql.functions as func

for col in sdf.columns:
if (sdf.filter(func.isnan(func.col(col)) == True).count() == sdf.select(func.col(col)).count()):
    sdf = sdf.drop(col)

Run Code Online (Sandbox Code Playgroud)

更新：
上面的代码删除了所有 nan 的列。如果您正在寻找所有空值，那么

import pyspark.sql.functions as func

for col in sdf.columns:
if (sdf.filter(func.col(col).isNull()).count() == sdf.select(func.col(col)).count()):
    sdf = sdf.drop(col)

Run Code Online (Sandbox Code Playgroud)

如果我找到最佳方法，我会更新我的答案:-)

归档时间：	8 年，6 月前
查看次数：	9952 次
最近记录：	4 年，11 月前