数据框加入空安全条件使用

Question

数据框加入空安全条件使用

Tib*_*rzz 5 python dataframe apache-spark apache-spark-sql pyspark

我有两个试图使用PySpark 2.3.0加入的具有空值的数据框：

dfA：

# +----+----+
# |col1|col2|
# +----+----+
# |   a|null|
# |   b|   0|
# |   c|   0|
# +----+----+

Run Code Online (Sandbox Code Playgroud)

dfB：

# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# |   a|null|   x|
# |   b|   0|   x|
# +----+----+----+

Run Code Online (Sandbox Code Playgroud)

可以使用以下脚本创建数据框：

dfA = spark.createDataFrame(
    [
        ('a', None),
        ('b', '0'),
        ('c', '0')
    ],
    ('col1', 'col2')
)

dfB = spark.createDataFrame(
    [
        ('a', None, 'x'),
        ('b', '0', 'x')
    ],
    ('col1', 'col2', 'col3')
)

Run Code Online (Sandbox Code Playgroud)

加入通话：

dfA.join(dfB, dfB.columns[:2], how='left').orderBy('col1').show()

Run Code Online (Sandbox Code Playgroud)

结果：

# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# |   a|null|null|  <- col3 should be x
# |   b|   0|   x|
# |   c|   0|null|
# +----+----+----+

Run Code Online (Sandbox Code Playgroud)

预期结果：

# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# |   a|null|   x|  <-
# |   b|   0|   x|
# |   c|   0|null|
# +----+----+----+

Run Code Online (Sandbox Code Playgroud)

如果我将第一行col2设置为null以外的其他任何东西，它将起作用，但是我需要支持null值。

我尝试使用条件比较使用本文章中概述的null安全等于，如下所示：

cond = (dfA.col1.eqNullSafe(dfB.col1) & dfA.col2.eqNullSafe(dfB.col2))
dfA.join(dfB, cond, how='left').orderBy(dfA.col1).show()

Run Code Online (Sandbox Code Playgroud)

空安全连接的结果：

# +----+----+----+----+----+
# |col1|col2|col1|col2|col3|
# +----+----+----+----+----+
# |   a|null|   a|null|   x|
# |   b|   0|   b|   0|   x|
# |   c|   0|null|null|null|
# +----+----+----+----+----+

Run Code Online (Sandbox Code Playgroud)

尽管这会保留重复的列，但我仍在寻找一种在联接结束时实现预期结果的方法。

Answer 1

Sha*_*ica 5

一个简单的解决方案是保留select您想要保留的列。这将允许您指定它们应该来自哪个源数据帧，并避免重复列问题。

dfA.join(dfB, cond, how='left').select(dfA.col1, dfA.col2, dfB.col3).orderBy('col1').show()

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，7 月前
查看次数：	947 次
最近记录：	7 年，7 月前