如何在 pyspark 中使用具有多个条件的 join？

Question

如何在 pyspark 中使用具有多个条件的 join？

Viv*_*Viv 2 python apache-spark apache-spark-sql

我可以使用带有单个条件的数据帧连接语句（在 pyspark 中）但是，如果我尝试添加多个条件，那么它就会失败。

代码：

   summary2 = summary.join(county_prop, ["category_id", "bucket"], how = "leftouter").

Run Code Online (Sandbox Code Playgroud)

上面的代码有效。但是，如果我为列表添加一些其他条件，例如 Summary.bucket == 9 或其他条件，则会失败。请帮我解决这个问题。

   The error for the statement 
   summary2 = summary.join(county_prop, ["category_id", (summary.bucket)==9], how = "leftouter")

   ERROR : TypeError: 'Column' object is not callable

Run Code Online (Sandbox Code Playgroud)

编辑：

添加完整的工作示例。

   schema = StructType([StructField("category", StringType()), StructField("category_id", StringType()), StructField("bucket", StringType()), StructField("prop_count", StringType()), StructField("event_count", StringType()), StructField("accum_prop_count",StringType())])
   bucket_summary = sqlContext.createDataFrame([],schema)

   temp_county_prop = sqlContext.createDataFrame([("nation","nation",1,222,444,555),("nation","state",2,222,444,555)],schema)
   bucket_summary = bucket_summary.unionAll(temp_county_prop)
   county_prop = sqlContext.createDataFrame([("nation","state",2,121,221,551)],schema)

Run Code Online (Sandbox Code Playgroud)

想要加入：

category_id 和bucket 列，我想替换bucket_summary 上的county_prop 的值。

   cond = [bucket_summary.bucket == county_prop.bucket, bucket_summary.bucket == 2]

Run Code Online (Sandbox Code Playgroud)

Bucket_summary2 = Bucket_summary.join(county_prop, cond, how = "leftouter")

   1. It works if I mention the whole statement with cols, but if I list conditions like ["category_id", "bucket"]  --- THis too works.

   2. But, if I use a combination of both like cond =["bucket", bucket_summary.category_id == "state"]

Run Code Online (Sandbox Code Playgroud)

它不起作用。2 语句可能会出现什么问题？

Answer 1

Zha*_*ong 5

例如

df1.join(df2, on=[df1['age'] == df2['age'], df1['sex'] == df2['sex']], how='left_outer')

Run Code Online (Sandbox Code Playgroud)

但就您而言，(summary.bucket)==9不应显示为连接条件

更新：

在连接条件中，您可以使用列表Column join expression 或列表Column / column_name

归档时间：	8 年，6 月前
查看次数：	14099 次
最近记录：	8 年，6 月前