Spark Scala过滤DataFrame,其中值不在另一个DataFrame中

Question

我有两个DataFrames:a和b.这是他们的样子:

a
-------
v1 string
v2 string

roughly hundreds of millions rows


b
-------
v2 string

roughly tens of millions rows

我想保留DataFrame a中v2不存在的行b("v2").

我知道我可以使用左连接和过滤器,其中右侧为null或SparkSQL具有"不在"构造.我打赌有更好的方法.

Answer 1

使用PairRDDFunctions.subtractByKey：

def subtractByKey [W](其他: RDD[(K, W)])(隐式 arg0: ClassTag[W]): RDD[(K, V)]

返回一个 RDD，其中的键不在 other 中的对。

（有一些变体可以提供对分区的控制。请参阅文档。）

所以你会这样做a.rdd.map { case (v1, v2) => (v2, v1) }.subtractByKey(b.rdd).toDF。