Use*_*345 0 apache-spark pyspark
我有两个数据框pyspark。他们的架构如下
df1
DataFrame[customer_id: int, email: string, city: string, state: string, postal_code: string, serial_number: string]
df2
DataFrame[serial_number: string, model_name: string, mac_address: string]
Run Code Online (Sandbox Code Playgroud)
现在我想full outer join通过coalesce使用data frames.
我已经做了如下。我得到了预期的结果。
full_df = df1.join(df2, df1.serial_number == df2.serial_number, 'full_outer').select(df1.customer_id, df1.email, df1.city, df1.state, df1.postal_code, f.coalesce(df1.serial_number, df2.serial_number).alias('serial_number'), df2.model_name, df2.mac_address)
Run Code Online (Sandbox Code Playgroud)
现在我想以不同的方式做上述事情。我不想在 join 语句中编写 select 附近的所有列名称,而是想做一些类似*在data frame. 基本上我想要像下面这样的东西。
full_df = df1.join(df2, df1.serial_number == df2.serial_number, 'full_outer').select('df1.*', f.coalesce(df1.serial_number, df2.serial_number).alias('serial_number1'), df2.model_name, df2.mac_address).drop('serial_number')
Run Code Online (Sandbox Code Playgroud)
我得到了我想要的。有没有更好的方法来进行这种操作pyspark
编辑
/sf/ask/2529262571/?rq=1这与I am using a in the join 语句不重复coalesce。我想知道是否有一种方法可以排除我正在使用该coalesce函数的列
小智 5
你可以这样做:
(df1
.join(df2, df1.serial_number == df2.serial_number, 'full_outer')
.select(
[df1[c] for c in df1.columns if c != 'serial_number'] +
[f.coalesce(df1.serial_number, df2.serial_number)]
))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
5950 次 |
| 最近记录: |