Spark从DataFrame中删除重复的行

voi*_*oid 4 scala dataframe apache-spark apache-spark-sql

假设我有一个DataFrame,如:

val json = sc.parallelize(Seq("""{"a":1, "b":2, "c":22, "d":34}""","""{"a":3, "b":9, "c":22, "d":12}""","""{"a":1, "b":4, "c":23, "d":12}"""))
val df = sqlContext.read.json(json)
Run Code Online (Sandbox Code Playgroud)

我想根据列"b"的值删除列"a"的重复行.即,如果列"a"有重复的行,我想保留"b"值较大的行.对于上面的例子,经过处理后,我只需要

{"a":3,"b":9,"c":22,"d":12}

{"a":1,"b":4,"c":23,"d":12}

Spark DataFrame dropDuplicates API似乎不支持这一点.使用RDD方法,我可以做一个map().reduceByKey(),但是DataFrame特定的操作是做什么的呢?

感谢一些帮助,谢谢.

Pan*_*ora 9

您可以在sparksql中使用window函数来实现此目的.

df.registerTempTable("x")
sqlContext.sql("SELECT a, b,c,d  FROM( SELECT *, ROW_NUMBER()OVER(PARTITION BY a ORDER BY b DESC) rn FROM x) y WHERE rn = 1").collect
Run Code Online (Sandbox Code Playgroud)

这将实现您的需求.了解有关Window功能支持的更多信息https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html