use*_*528 6 python apache-spark pyspark
缩短的示例:
vals1 = [(1, "a"),
(2, "b"),
]
columns1 = ["id","name"]
df1 = spark.createDataFrame(data=vals1, schema=columns1)
vals2 = [(1, "k"),
]
columns2 = ["id","name"]
df2 = spark.createDataFrame(data=vals2, schema=columns2)
df1 = df1.alias('df1').join(df2.alias('df2'), 'id', 'full')
df1.show()
Run Code Online (Sandbox Code Playgroud)
结果具有名为 的一列id和名为 的两列name。假设真实的数据帧有数十个这样的列,如何重命名具有重复名称的列?
小智 3
仅重命名相交列的另一种方法
from typing import List
from pyspark.sql import DataFrame
def join_intersect(df_left: DataFrame, df_right: DataFrame, join_cols: List[str], how: str = 'inner'):
intersected_cols = set(df1.columns).intersection(set(df2.columns))
cols_to_rename = [c for c in intersected_cols if c not in join_cols]
for c in cols_to_rename:
df_left = df_left.withColumnRenamed(c, f"{c}__1")
df_right = df_right.withColumnRenamed(c, f"{c}__2")
return df_left.join(df_right, on=join_cols, how=how)
vals1 = [(1, "a"), (2, "b")]
columns1 = ["id", "name"]
df1 = spark.createDataFrame(data=vals1, schema=columns1)
vals2 = [(1, "k")]
columns2 = ["id", "name"]
df2 = spark.createDataFrame(data=vals2, schema=columns2)
df_joined = join_intersect(df1, df2, ['name'])
df_joined.show()
Run Code Online (Sandbox Code Playgroud)