为什么我的代码存储库警告我在 for/while 循环中使用 withColumn ？

Question

为什么我的代码存储库警告我在 for/while 循环中使用 withColumn ？

van*_*ser 2 pyspark palantir-foundry foundry-code-repositories foundry-python-transform

我注意到我的代码存储库警告我在 for/while 循环中使用 withColumn 是一种反模式。为什么不推荐这样做？这不是PySpark API的正常使用吗？

Answer 1

我们在实践中注意到，在withColumnfor/while 循环内部使用会导致查询计划性能不佳，如此处所述。第一次在 Foundry 中编写代码时，这一点并不明显，因此我们构建了一个功能来警告您此行为。

withColumn(colName: String, col: Column): DataFrame
Returns a new Dataset by adding a column or replacing the existing column that has the same name.

Since
2.0.0

Note
this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select with the multiple columns at once.

Run Code Online (Sandbox Code Playgroud)

IE

withColumn(colName: String, col: Column): DataFrame
Returns a new Dataset by adding a column or replacing the existing column that has the same name.

Since
2.0.0

Note
this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select with the multiple columns at once.

Run Code Online (Sandbox Code Playgroud)

远远优于

my_other_columns = [...]

df = df.select(
  *[col_name for col_name in df.columns if col_name not in my_other_columns],
  *[F.col(col_name).alias(col_name + "_suffix") for col_name in my_other_columns]
)

Run Code Online (Sandbox Code Playgroud)

虽然这在技术上可能是 PySpark API 的正常使用，但如果在您的作业中调用 withColumn 次数过多，则会导致查询计划性能不佳，因此我们希望您完全避免此问题。

归档时间：	3 年，11 月前
查看次数：	632 次
最近记录：	3 年，11 月前