pra*_*ash 6 apache-spark apache-spark-sql
Spark数据帧1-:
+------+-------+---------+----+---+-------+
|city |product|date |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 1|prod 1 |9/29/2017|358 |975|193 |
|city 1|prod 2 |8/25/2017|50 |687|201 |
|city 1|prod 3 |9/9/2017 |236 |431|169 |
|city 2|prod 1 |9/28/2017|358 |975|193 |
|city 2|prod 2 |8/24/2017|50 |687|201 |
|city 3|prod 3 |9/8/2017 |236 |431|169 |
+------+-------+---------+----+---+-------+
Run Code Online (Sandbox Code Playgroud)
Spark数据框2-:
+------+-------+---------+----+---+-------+
|city |product|date |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 1|prod 1 |9/29/2017|358 |975|193 |
|city 1|prod 2 |8/25/2017|50 |687|201 |
|city 1|prod 3 |9/9/2017 |230 |430|160 |
|city 1|prod 4 |9/27/2017|350 |90 |190 |
|city 2|prod 2 |8/24/2017|50 |687|201 |
|city 3|prod 3 |9/8/2017 |236 |431|169 |
|city 3|prod 4 |9/18/2017|230 |431|169 |
+------+-------+---------+----+---+-------+
Run Code Online (Sandbox Code Playgroud)
请找出适用于上述给定火花数据框1和火花数据框2的以下条件的火花数据框。
更改记录
这里的关键是“城市”,“产品”,“日期”。
我们需要不使用Spark SQL的解决方案。
我不确定要查找已删除和已修改的记录,但是可以使用except函数来获取差异
df2.except(df1)
Run Code Online (Sandbox Code Playgroud)
这将返回已在dataframe2中添加或修改的行或具有更改的记录。输出:
+------+-------+---------+----+---+-------+
| city|product| date|sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 3| prod 4|9/18/2017| 230|431| 169|
|city 1| prod 4|9/27/2017| 350| 90| 190|
|city 1| prod 3|9/9/2017 | 230|430| 160|
+------+-------+---------+----+---+-------+
Run Code Online (Sandbox Code Playgroud)
您也可以尝试使用join和filter来获取更改和未更改的数据,如下所示:
df1.join(df2, Seq("city","product", "date"), "left").show(false)
df1.join(df2, Seq("city","product", "date"), "right").show(false)
Run Code Online (Sandbox Code Playgroud)
希望这可以帮助!
一种可扩展且简单的方法是DataFrame使用spark-extension区分两个s :
import uk.co.gresearch.spark.diff._
df1.diff(df2, "city", "product", "date").show
+----+------+-------+----------+---------+----------+--------+---------+------------+-------------+
|diff| city|product| date|left_sale|right_sale|left_exp|right_exp|left_wastage|right_wastage|
+----+------+-------+----------+---------+----------+--------+---------+------------+-------------+
| N|city 1|prod 2 |2017-08-25| 50| 50| 687| 687| 201| 201|
| C|city 1|prod 3 |2017-09-09| 236| 230| 431| 430| 169| 160|
| I|city 3|prod 4 |2017-09-18| null| 230| null| 431| null| 169|
| N|city 3|prod 3 |2017-09-08| 236| 236| 431| 431| 169| 169|
| D|city 2|prod 1 |2017-09-28| 358| null| 975| null| 193| null|
| I|city 1|prod 4 |2017-09-27| null| 350| null| 90| null| 190|
| N|city 1|prod 1 |2017-09-29| 358| 358| 975| 975| 193| 193|
| N|city 2|prod 2 |2017-08-24| 50| 50| 687| 687| 201| 201|
+----+------+-------+----------+---------+----------+--------+---------+------------+-------------+
Run Code Online (Sandbox Code Playgroud)
它确定我nserted,Ç上吊,d eleted和u Ñ -changed行。