过滤列值是否等于Spark中的列表

Luk*_*uke 8 python apache-spark apache-spark-sql pyspark

我正在尝试根据列中的值是否等于列表来过滤Spark数据帧.我想做这样的事情:

filtered_df = df.where(df.a == ['list','of' , 'stuff'])
Run Code Online (Sandbox Code Playgroud)

其中filtered_df只包含行,其中的价值filtered_df.a就是['list','of' , 'stuff']和类型aarray (nullable = true).

Ass*_*son 8

你可以创建一个udf.例如:

def test_in(x):
    return x == ['list','of' , 'stuff']

from pyspark.sql.functions import udf
f = udf(test_in, pyspark.sql.types.BooleanType())
filtered_df = df.where(f(df.a))
Run Code Online (Sandbox Code Playgroud)


zer*_*323 7

更新:

使用当前版本,您可以使用array文字:

from pyspark.sql.functions import array, lit

df.where(df.a == array(*[lit(x) for x in ['list','of' , 'stuff']]))
Run Code Online (Sandbox Code Playgroud)

原始答案:

好吧,有点hacky方式,这不需要Python批处理作业,是这样的:

from pyspark.sql.functions import col, lit, size
from functools import reduce
from operator import and_

def array_equal(c, an_array):
    same_size = size(c) == len(an_array)  # Check if the same size
    # Check if all items equal
    same_items = reduce(
        and_, 
        (c.getItem(i) == an_array[i] for i in range(len(an_array)))
    )
    return and_(same_size, same_items)
Run Code Online (Sandbox Code Playgroud)

快速测试:

df = sc.parallelize([
    (1, ['list','of' , 'stuff']),
    (2, ['foo', 'bar']),
    (3, ['foobar']),
    (4, ['list','of' , 'stuff', 'and', 'foo']),
    (5, ['a', 'list','of' , 'stuff']),
]).toDF(['id', 'a'])

df.where(array_equal(col('a'), ['list','of' , 'stuff'])).show()
## +---+-----------------+
## | id|                a|
## +---+-----------------+
## |  1|[list, of, stuff]|
## +---+-----------------+
Run Code Online (Sandbox Code Playgroud)


Geo*_*los 1

现在您可以轻松地执行以下操作:

df.filter(F.col(selected_col).isin([list_with_values]))
Run Code Online (Sandbox Code Playgroud)

我发现它更加简单、整洁和直接。