Ath*_*kur 3 scala apache-spark apache-spark-sql
我有两个如下所示的数据框
+--------------------+--------+-----------+-------------+
|UniqueFundamentalSet|Taxonomy|FFAction|!||DataPartition|
+--------------------+--------+-----------+-------------+
|192730241374 |1 |I|!| |Japan |
|192730241374 |2 |I|!| |Japan |
|192730241373 |1 |I|!| |Japan |
|192730241373 |2 |I|!| |Japan |
+--------------------+--------+-----------+-------------+
+--------------------+--------+-----------+-------------+
|UniqueFundamentalSet|Taxonomy|FFAction|!||DataPartition|
+--------------------+--------+-----------+-------------+
|192730241374 |1 |I|!| |Japan |
|192730241374 |2 |I|!| |Japan |
|192730391384 |1 |I|!| |Japan |
|192730391384 |2 |I|!| |Japan |
|192730241373 |1 |I|!| |Japan |
|192730241373 |2 |I|!| |Japan |
+--------------------+--------+-----------+-------------+
Run Code Online (Sandbox Code Playgroud)
当我在上述数据框之间执行联合时,我得到重复的行。这是我的输出
+--------------------+--------+-----------+-------------+
|UniqueFundamentalSet|Taxonomy|FFAction|!||DataPartition|
+--------------------+--------+-----------+-------------+
|192730241374 |1 |I|!| |Japan |
|192730241374 |2 |I|!| |Japan |
|192730241373 |1 |I|!| |Japan |
|192730241373 |2 |I|!| |Japan |
|192730241374 |1 |I|!| |Japan |
|192730241374 |2 |I|!| |Japan |
|192730391384 |1 |I|!| |Japan |
|192730391384 |2 |I|!| |Japan |
|192730241373 |1 |I|!| |Japan |
|192730241373 |2 |I|!| |Japan |
+--------------------+--------+-----------+-------------+
val dfToSave = dfMainOutput.union(insertdf)
Run Code Online (Sandbox Code Playgroud)
我的印象是 union 删除了重复的行,而 unionall 保留了它。我必须在 union 之后使用 distinct 。有人可以解释一下。
| 归档时间: |
|
| 查看次数: |
6578 次 |
| 最近记录: |