pySpark .join() 具有不同的列名称，并且不能在运行前进行硬编码

Question

我发现 final = ta.join(tb, on=['ID'], how='left') 左右两侧都有一个同名的“ID”列。

我得到了这个 final = ta.join(tb, ta.leftColName == tb.rightColName, how='left') 左右列名称在运行时之前是已知的，因此可以对列名称进行硬编码。

但是，如果 on 谓词的左右列名称不同并且是通过配置变量计算/导出的，该怎么办？例如：

1) leftColName = 'leftKey'

2) rightColName = 'rightKey'

3) final = ta.join(tb, ta.leftColname == tb.rightColname, how='left')

leftColName在第 3 行可以被硬编码和执行之前， &的值rightColName是未知的。

这不起作用，因为我发现运行时可能会间歇性地混淆/迷失是指还是rightColName指tatb

final = ta.join(tb, f.col(leftColName) == f.col(rightColName), 'left')

Scala 似乎有一个工具可以实现这一点。

Answer 1

您将该列引用为ta.leftColname，但是 - 与 Pandas 类似 - 您也可以通过引用它ta["leftColname"]。

这样，您还可以使用变量来代替硬编码的列名称。例如：

left_key = 'leftColname'
right_key = 'rightColname'
final = ta.join(tb, ta[left_key] == tb[right_key], how='left')