如何在PySpark中使用公共密钥加入/合并数据帧列表?

Geo*_*eRF 5 python apache-spark apache-spark-sql pyspark

df1
     uid1  var1
0  John         3
1  Paul         4
2  George       5
Run Code Online (Sandbox Code Playgroud)
df2
     uid1  var2
0  John         23
1  Paul         44
2  George       52
Run Code Online (Sandbox Code Playgroud)
df3
     uid1  var3
0  John         31
1  Paul         45
2  George       53
Run Code Online (Sandbox Code Playgroud)
df_lst=[df1,df2,df3]
Run Code Online (Sandbox Code Playgroud)

如何基于公共密钥uid1合并/加入列表中的3个数据帧?

编辑:预期输出

   df1
     uid1  var1     var2    var3
0  John         3        23      31
1  Paul         4        44      45
2  George       5        52      53
Run Code Online (Sandbox Code Playgroud)

Sha*_*ala 6

您可以加入数据框列表.下面是一个简单的例子

import spark.implicits._
    val df1 = spark.sparkContext.parallelize(Seq(
      (0,"John",3),
    (1,"Paul",4),
    (2,"George",5)
    )).toDF("id", "uid1", "var1")

    import spark.implicits._
    val df2 = spark.sparkContext.parallelize(Seq(
      (0,"John",23),
      (1,"Paul",44),
      (2,"George",52)
    )).toDF("id", "uid1", "var2")

    import spark.implicits._
    val df3 = spark.sparkContext.parallelize(Seq(
      (0,"John",31),
      (1,"Paul",45),
      (2,"George",53)
    )).toDF("id", "uid1", "var3")


    val df = List(df1, df2, df3)

    df.reduce((a,b) => a.join(b, Seq("id", "uid1")))
Run Code Online (Sandbox Code Playgroud)

输出:

+---+------+----+----+----+
| id|  uid1|var1|var2|var3|
+---+------+----+----+----+
|  1|  Paul|   4|  44|  45|
|  2|George|   5|  52|  53|
|  0|  John|   3|  23|  31|
+---+------+----+----+----+
Run Code Online (Sandbox Code Playgroud)

希望这可以帮助!

  • 不,无法弄清楚python中的reduce (2认同)
  • 这个答案是正确的方向,但没有回答 pyspark 中的问题。例如“List”和“=>”需要翻译成pyspark (2认同)