che*_*ens 2 join dataframe apache-spark apache-spark-sql pyspark
我有两个以下的Spark数据框:
sale_df:
|user_id|total_sale|
+-------+----------+
| a| 1100|
| b| 2100|
| c| 3300|
| d| 4400
Run Code Online (Sandbox Code Playgroud)
和target_df:
user_id|personalized_target|
+-------+-------------------+
| b| 1000|
| c| 2000|
| d| 3000|
| e| 4000|
+-------+-------------------+
Run Code Online (Sandbox Code Playgroud)
如何以输出方式加入它们:
user_id total_sale personalized_target
a 1100 NA
b 2100 1000
c 3300 2000
d 4400 4000
e NA 4000
Run Code Online (Sandbox Code Playgroud)
我已经尝试了所有连接类型,但似乎单个连接无法生成所需的输出.
任何PySpark或SQL和HiveContext都可以提供帮助.
您可以在Scala中使用equi-join synthax
val output = sales_df.join(target_df,Seq("user_id"),joinType="outer")
Run Code Online (Sandbox Code Playgroud)
您应该检查它是否在python中工作:
output = sales_df.join(target_df,['user_id'],"outer")
Run Code Online (Sandbox Code Playgroud)
您需要执行外部 equi-join :
data1 = [['a', 1100], ['b', 2100], ['c', 3300], ['d', 4400]]
sales = sqlContext.createDataFrame(data1,['user_id','total_sale'])
data2 = [['b', 1000],['c',2000],['d',3000],['e',4000]]
target = sqlContext.createDataFrame(data2,['user_id','personalized_target'])
sales.join(target, 'user_id', "outer").show()
# +-------+----------+-------------------+
# |user_id|total_sale|personalized_target|
# +-------+----------+-------------------+
# | e| null| 4000|
# | d| 4400| 3000|
# | c| 3300| 2000|
# | b| 2100| 1000|
# | a| 1100| null|
# +-------+----------+-------------------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
19201 次 |
| 最近记录: |