PySpark：使用isin过滤返回空数据框

Question

PySpark：使用isin过滤返回空数据框

LeP*_*ppy 4 python apache-spark apache-spark-sql pyspark pyspark-sql

上下文： 我需要使用isin函数基于包含另一个数据框的列的内容过滤数据框。

对于使用熊猫的Python用户，应该为：isin（）。
对于R用户，应为：％in％。

所以我有一个带有id和value列的简单spark数据框：

l = [(1, 12), (1, 44), (1, 3), (2, 54), (3, 18), (3, 11), (4, 13), (5, 78)]
df = spark.createDataFrame(l, ['id', 'value'])
df.show()

+---+-----+
| id|value|
+---+-----+
|  1|   12|
|  1|   44|
|  1|    3|
|  2|   54|
|  3|   18|
|  3|   11|
|  4|   13|
|  5|   78|
+---+-----+

Run Code Online (Sandbox Code Playgroud)

我想获取所有出现多次的ID。这是df中唯一ID的数据框：

unique_ids = df.groupBy('id').count().where(col('count') < 2)
unique_ids.show()

+---+-----+
| id|count|
+---+-----+
|  5|    1|
|  2|    1|
|  4|    1|
+---+-----+

Run Code Online (Sandbox Code Playgroud)

因此，逻辑运算将是：

 df = df[~df.id.isin(unique_ids.id)]
 # This is the same than:
 df = df[df.id.isin(unique_ids.id) == False]

Run Code Online (Sandbox Code Playgroud)

但是，我得到一个空的数据框：

df.show()

+---+-----+
| id|value|
+---+-----+
+---+-----+

Run Code Online (Sandbox Code Playgroud)

这种“错误”以相反的方式起作用：

df[df.id.isin(unique_ids.id)]

Run Code Online (Sandbox Code Playgroud)

返回df的所有行。

Answer 1

Ama*_*nda 9

该表达式df.id.isin(unique_ids.id) == False正在评估是否，Column<b'((id IN (id)) = false)'>并且这将永远不会发生，因为id在id中。但是，表达式df.id.isin(unique_ids.id)正在评估if Column<b'(id IN (id))'>，并且始终为true，因此它返回整个数据帧。unique_ids.id是列而不是列表。

isin(*cols)接收值列表作为参数而不是列，因此，以这种方式工作，您应该执行以下命令：

ids = unique_ids.rdd.map(lambda x:x.id).collect()
df[df.id.isin(ids)].collect() # or show...

Run Code Online (Sandbox Code Playgroud)

您将获得：

[Row(id=2, value=54), Row(id=4, value=13), Row(id=5, value=78)]

Run Code Online (Sandbox Code Playgroud)

无论如何，我认为将两个数据框都加入会更好：

df_ = df.join(unique_ids, on='id')

Run Code Online (Sandbox Code Playgroud)

得到：

df_.show()
+---+-----+-----+
| id|value|count|
+---+-----+-----+
|  5|   78|    1|
|  2|   54|    1|
|  4|   13|    1|
+---+-----+-----+

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，11 月前
查看次数：	720 次
最近记录：	6 年，11 月前