Que*_*ank 9 apache-spark pyspark
我在pyspark下面创建了两个数据框.在这些data frames我有专栏id.我想full outer join对这两个数据帧执行a .
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
a = sqlContext.createDataFrame(valuesA,['name','id'])
a.show()
+---------+---+
| name| id|
+---------+---+
| Pirate| 1|
| Monkey| 2|
| Ninja| 3|
|Spaghetti| 4|
+---------+---+
valuesB = [('dave',1),('Thor',2),('face',3), ('test',5)]
b = sqlContext.createDataFrame(valuesB,['Movie','id'])
b.show()
+-----+---+
|Movie| id|
+-----+---+
| dave| 1|
| Thor| 2|
| face| 3|
| test| 5|
+-----+---+
full_outer_join = a.join(b, a.id == b.id,how='full')
full_outer_join.show()
+---------+----+-----+----+
| name| id|Movie| id|
+---------+----+-----+----+
| Pirate| 1| dave| 1|
| Monkey| 2| Thor| 2|
| Ninja| 3| face| 3|
|Spaghetti| 4| null|null|
| null|null| test| 5|
+---------+----+-----+----+
Run Code Online (Sandbox Code Playgroud)
我想做一个像下面这样的结果 full_outer_join
+---------+-----+----+
| name|Movie| id|
+---------+-----+----+
| Pirate| dave| 1|
| Monkey| Thor| 2|
| Ninja| face| 3|
|Spaghetti| null| 4|
| null| test| 5|
+---------+-----+----+
Run Code Online (Sandbox Code Playgroud)
我已经完成了以下但是得到了一些不同的结果
full_outer_join = a.join(b, a.id == b.id,how='full').select(a.id, a.name, b.Movie)
full_outer_join.show()
+---------+----+-----+
| name| id|Movie|
+---------+----+-----+
| Pirate| 1| dave|
| Monkey| 2| Thor|
| Ninja| 3| face|
|Spaghetti| 4| null|
| null|null| test|
+---------+----+-----+
Run Code Online (Sandbox Code Playgroud)
你可以看到我Id 5在我的遗失中result data frame.
我怎样才能实现我的目标?
Psi*_*dom 12
由于连接列具有相同的名称,因此您可以将连接列指定为列表:
a.join(b, ['id'], how='full').show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
Run Code Online (Sandbox Code Playgroud)
或者coalesce两id列:
import pyspark.sql.functions as F
a.join(b, a.id == b.id, how='full').select(
F.coalesce(a.id, b.id).alias('id'), a.name, b.Movie
).show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
Run Code Online (Sandbox Code Playgroud)
您可以重新命名数据框 b 中的列 id 并稍后删除,也可以在连接条件中使用列表。
a.join(b, ['id'], how='full')
Run Code Online (Sandbox Code Playgroud)
输出:
+---+---------+-----+
|id |name |Movie|
+---+---------+-----+
|1 |Pirate |dave |
|3 |Ninja |face |
|5 |null |test |
|4 |Spaghetti|null |
|2 |Monkey |Thor |
+---+---------+-----+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
13004 次 |
| 最近记录: |