flatMapGroupsWithState 中 OutputMode 的目的是什么？如何/在哪里使用它？

Question

flatMapGroupsWithState 中 OutputMode 的目的是什么？如何/在哪里使用它？

Jac*_*ski 6 apache-spark spark-structured-streaming

我正在探索KeyValueGroupedDataset.flatMapGroupsWithStateSpark Structured Streaming 中的任意状态聚合。

KeyValueGroupedDataset.flatMapGroupsWithState操作员签名如下：

flatMapGroupsWithState[S: Encoder, U: Encoder](
  outputMode: OutputMode,
  timeoutConf: GroupStateTimeout)(
  func: (K, Iterator[V], GroupState[S]) => Iterator[U]): Dataset[U]

Run Code Online (Sandbox Code Playgroud)

OutputMode辩论的目的是什么？

在查看源代码（作为底层物理运算符的FlatMapGroupsWithStateExec 的）时，我找不到任何OutputMode可以使用的地方。

Answer 1

Bar*_*zny 5

确实，我也没有发现任何用途。我对此有几个理论：

这里的模式是为了与逻辑运算符的签名保持一致org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsWithState。如果您检查org.apache.spark.sql.execution.SparkStrategies.BasicOperatorsapply 方法，您会注意到逻辑运算符经常将其所有参数传递给物理运算符。我不确定，但这看起来像是设计指南，但这只是我的假设。
这也可能是一个遗留原因。FlatMapGroupsWithState从MapGroupsWithState为了强制输出模式语义演变而来。它在此 PR 中实现：https://github.com/apache/spark/pull/17197/files ( SPARK-19858 )，并重MapGroupsWithState命名为FlatMapGroupsWithState并outputMode添加为参数。也许 - 如果我之前的理论是错误的 - 它在这里只是因为它通过了 PR 并且没有人愿意抱怨它，因为“它已经在这里”原则？
也许将来outputMode会被传递给映射函数？我发现用于保存流聚合的节点 ( StateStoreSaveExec) 使用输出模式来找出要保存在状态存储中的条目。也许这将是一个即将添加的用于*withState转型的新功能，正如评论中顺便说的那样：
- @param outputMode 的输出模式func

归档时间：	6 年，3 月前
查看次数：	693 次
最近记录：	6 年，2 月前