PySpark - 获取重复行的索引

Question

PySpark - 获取重复行的索引

假设我有一个 PySpark 数据框，如下所示：

+--+--+--+--+
|a |b |c |d |
+--+--+--+--+
|1 |0 |1 |2 |
|0 |2 |0 |1 |
|1 |0 |1 |2 |
|0 |4 |3 |1 |
+--+--+--+--+

Run Code Online (Sandbox Code Playgroud)

如何创建标记所有重复行的列，如下所示：

+--+--+--+--+--+
|a |b |c |d |e |
+--+--+--+--+--+
|1 |0 |1 |2 |1 |
|0 |2 |0 |1 |0 |
|1 |0 |1 |2 |1 |
|0 |4 |3 |1 |0 |
+--+--+--+--+--+

Run Code Online (Sandbox Code Playgroud)

我尝试使用 groupBy 和聚合函数无济于事。

Answer 1

pau*_*ult 8

只是为了扩展我的评论：

您可以按所有列分组并用于pyspark.sql.functions.count()确定列是否重复：

import pyspark.sql.functions as f
df.groupBy(df.columns).agg((f.count("*")>1).cast("int").alias("e")).show()
#+---+---+---+---+---+
#|  a|  b|  c|  d|  e|
#+---+---+---+---+---+
#|  1|  0|  1|  2|  1|
#|  0|  2|  0|  1|  0|
#|  0|  4|  3|  1|  0|
#+---+---+---+---+---+

Run Code Online (Sandbox Code Playgroud)

这里我们使用count("*") > 1as 聚合函数，并将结果转换为int. 这groupBy()将导致删除重复行的结果。根据您的需要，这可能就足够了。

但是，如果您想保留所有行，您可以使用Window其他答案中所示的函数，或者您可以使用join()：

df.join(
    df.groupBy(df.columns).agg((f.count("*")>1).cast("int").alias("e")),
    on=df.columns,
    how="inner"
).show()
#+---+---+---+---+---+
#|  a|  b|  c|  d|  e|
#+---+---+---+---+---+
#|  1|  0|  1|  2|  1|
#|  1|  0|  1|  2|  1|
#|  0|  2|  0|  1|  0|
#|  0|  4|  3|  1|  0|
#+---+---+---+---+---+

Run Code Online (Sandbox Code Playgroud)

在这里，我们将原始数据帧与作为groupBy()上述所有列的结果的数据帧进行内部连接。

Answer 2

Ram*_*jan 7

定义一个window函数来检查count所有列分组时的行数是否大于 1。如果是，则为重复 (1) 否则不重复 (0)

allColumns = df.columns
import sys
from pyspark.sql import functions as f
from pyspark.sql import window as w
windowSpec = w.Window.partitionBy(allColumns).rowsBetween(-sys.maxint, sys.maxint)

df.withColumn('e', f.when(f.count(f.col('d')).over(windowSpec) > 1, f.lit(1)).otherwise(f.lit(0))).show(truncate=False)

Run Code Online (Sandbox Code Playgroud)

这应该给你

+---+---+---+---+---+
|a  |b  |c  |d  |e  |
+---+---+---+---+---+
|1  |0  |1  |2  |1  |
|1  |0  |1  |2  |1  |
|0  |2  |0  |1  |0  |
|0  |4  |3  |1  |0  |
+---+---+---+---+---+

Run Code Online (Sandbox Code Playgroud)

我希望答案有帮助

更新

正如@pault评论的那样，您可以消除when,col并lit通过强制转换boolean为integer：

df.withColumn('e', (f.count('*').over(windowSpec) > 1).cast('int')).show(truncate=False)

Run Code Online (Sandbox Code Playgroud)

这里不需要 `when`、`col` 或 `lit` - 您可以将条件转换为整数： `df.withColumn('e', (f.count('*').over(windowSpec) > 1).cast('int')).show(truncate=False)` (2认同)

归档时间：	7 年，8 月前
查看次数：	12332 次
最近记录：	4 年，6 月前