Python 与 Java 在 GCP Dataflow 上运行的 Apache Beam 性能对比

Question

Python 与 Java 在 GCP Dataflow 上运行的 Apache Beam 性能对比

Jie*_*ang 5 python java sliding-window google-cloud-platform apache-beam

我们在使用 Python 和 Java 编写的 GCP 数据流上运行 Beam 数据管道。一开始，我们有一些简单直接的 Python Beam 作业，效果非常好。因此，最近我们决定将更多的 Java Beam 作业转换为 Python Beam 作业。当我们有更复杂的工作时，尤其是需要在光束中开窗的工作时，我们注意到 python 工作比 java 工作明显慢，最终使用更多的 cpu 和内存并且成本更高。

一些示例 python 代码如下所示：

        step1 = (
        read_from_pub_sub
        | "MapKey" >> beam.Map(lambda elem: (elem.data[key], elem))
        | "WindowResults"
        >> beam.WindowInto(
            beam.window.SlidingWindows(360,90),
            allowed_lateness=args.allowed_lateness,
        )
        | "GroupById" >> beam.GroupByKey()

Run Code Online (Sandbox Code Playgroud)

Java 代码如下：

 PCollection<DataStructure> step1 =
      message
          .apply(
              "MapKey",
              MapElements.into(
                      TypeDescriptors.kvs(
                          TypeDescriptors.strings(), TypeDescriptor.of(DataStructure.class)))
                  .via(event -> KV.of(event.key, event)))
          .apply(
              "WindowResults",
              Window.<KV<String, CustomInterval>>into(
                      SlidingWindows.of(Duration.standardSeconds(360))
                          .every(Duration.standardSeconds(90)))
                  .withAllowedLateness(Duration.standardSeconds(this.allowedLateness))
                  .discardingFiredPanes())
          .apply("GroupById", GroupByKey.<String, DataStructure>create())

Run Code Online (Sandbox Code Playgroud)

我们注意到 Python 使用的 CPU 和内存总是比 Java 多 3 倍。我们做了一些实验测试，只运行 JSON 输入和 JSON 输出，结果相同。我们不确定这只是因为 Python 一般来说比 java 慢，还是因为 GCP Dataflow 执行 Beam Python 和 Java 的方式不同。任何类似的经验、测试和原因都值得赞赏。

Answer 1

Ken*_*les 6

是的，这是 Python 和 Java 之间非常正常的性能因素。事实上，对于许多程序来说，该系数可能是 10 倍或更多。

程序的细节可以从根本上改变相对性能。这里有一些要考虑的事情：

分析数据流作业（官方文档）
分析数据流管道（中型博客）
分析 Apache Beam Python 管道（另一个媒体博客）
分析 Python（一般 Cloud Profiler 文档）
如何分析 Python Dataflow 作业？（之前关于分析 Python 作业的 StackOverflow 问题）

如果您更喜欢 Python 简洁的语法或库生态系统，那么提高速度的方法是使用优化的 C 库或 Cython 进行核心处理，例如使用 pandas/numpy 等。如果您使用Beam 的新的 Pandas 兼容数据帧 API，您将自动获得此好处。

归档时间：	3 年，7 月前
查看次数：	3123 次
最近记录：	3 年，7 月前