如何在Apache Spark中连接两个数据框并将键合并到一列?

che*_*ens 2 join dataframe apache-spark apache-spark-sql pyspark

我有两个以下的Spark数据框:

sale_df:

|user_id|total_sale|
+-------+----------+
|      a|      1100|
|      b|      2100|
|      c|      3300|
|      d|      4400  
Run Code Online (Sandbox Code Playgroud)

和target_df:

 user_id|personalized_target|
+-------+-------------------+
|      b|               1000|
|      c|               2000|
|      d|               3000|
|      e|               4000|
+-------+-------------------+
Run Code Online (Sandbox Code Playgroud)

如何以输出方式加入它们:

user_id   total_sale   personalized_target
 a           1100            NA
 b           2100            1000
 c           3300            2000
 d           4400            4000
 e           NA              4000
Run Code Online (Sandbox Code Playgroud)

我已经尝试了所有连接类型,但似乎单个连接无法生成所需的输出.

任何PySpark或SQL和HiveContext都可以提供帮助.

Wil*_*ton 9

您可以在Scala中使用equi-join synthax

  val output = sales_df.join(target_df,Seq("user_id"),joinType="outer")
Run Code Online (Sandbox Code Playgroud)

您应该检查它是否在python中工作:

   output = sales_df.join(target_df,['user_id'],"outer")
Run Code Online (Sandbox Code Playgroud)


eli*_*sah 5

您需要执行外部 equi-join :

data1 = [['a', 1100], ['b', 2100], ['c', 3300], ['d', 4400]]
sales = sqlContext.createDataFrame(data1,['user_id','total_sale'])
data2 = [['b', 1000],['c',2000],['d',3000],['e',4000]]
target = sqlContext.createDataFrame(data2,['user_id','personalized_target'])

sales.join(target, 'user_id', "outer").show()
# +-------+----------+-------------------+
# |user_id|total_sale|personalized_target|
# +-------+----------+-------------------+
# |      e|      null|               4000|
# |      d|      4400|               3000|
# |      c|      3300|               2000|
# |      b|      2100|               1000|
# |      a|      1100|               null|
# +-------+----------+-------------------+
Run Code Online (Sandbox Code Playgroud)