pyspark数据帧过滤器或基于列表包含

Question

pyspark数据帧过滤器或基于列表包含

use*_*475 28 filter apache-spark apache-spark-sql pyspark

我正在尝试使用列表过滤pyspark中的数据帧.我想要根据列表进行过滤,或者仅包含列表中具有值的记录.我的代码不起作用:

# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])

# define a list of scores
l = [10,18,20]

# filter out records by scores by list l
records = df.filter(df.score in l)
# expected: (0,1), (0,1), (0,2), (1,2)

# include only records with these scores in list l
records = df.where(df.score in l)
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)

Run Code Online (Sandbox Code Playgroud)

给出以下错误:ValueError:无法将列转换为bool:请使用'&'代表'和','|' 对于'或','〜'表示构建DataFrame布尔表达式时的'not'.

Answer 1

use*_*475 53

它所说的"df.score in l"无法评估,因为df.score为您提供了一列,并且"in"未在该列类型上定义使用"isin"

代码应该是这样的:

# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])

# define a list of scores
l = [10,18,20]

# filter out records by scores by list l
records = df.filter(~df.score.isin(l))
# expected: (0,1), (0,1), (0,2), (1,2)

# include only records with these scores in list l
df.where(df.score.isin(l))
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)

Run Code Online (Sandbox Code Playgroud)

如何使用广播变量作为列表而不是常规的python列表来执行此操作？尝试这样做时，我得到的“广播”对象没有属性“ _get_object_id”错误。 (2认同)

Answer 2

Vzz*_*arr 27

基于@user3133475 的回答，也可以像这样调用该isin()方法F.col()：

import pyspark.sql.functions as F


l = [10,18,20]
df.filter(F.col("score").isin(l))

Run Code Online (Sandbox Code Playgroud)

Answer 3

ban*_*men 13

我发现该实现比大型数据帧join要快得多：where

def filter_spark_dataframe_by_list(df, column_name, filter_list):
    """ Returns subset of df where df[column_name] is in filter_list """
    spark = SparkSession.builder.getOrCreate()
    filter_df = spark.createDataFrame(filter_list, df.schema[column_name].dataType)
    return df.join(filter_df, df[column_name] == filter_df["value"])

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，4 月前
查看次数：	39708 次
最近记录：	6 年，4 月前