使用 --experiments=upload_graph 使 Dataflowrunner 工作

Question

使用 --experiments=upload_graph 使 Dataflowrunner 工作

kry*_*anh 5 google-cloud-dataflow apache-beam

我有一个管道，它生成的数据流图（序列化 JSON 表示）超出了 API 的允许限制，因此无法像通常那样通过 apache beam 的数据流运行器启动。并且使用指示参数运行数据流运行--experiments=upload_graph器不起作用，并且失败并说没有指定步骤。

通过错误收到有关此尺寸问题的通知时，会提供以下信息：

the size of the serialized JSON representation of the pipeline exceeds the allowable limit for the API. 

Use experiment 'upload_graph' (--experiments=upload_graph)
to direct the runner to upload the JSON to your 
GCS staging bucket instead of embedding in the API request.

Run Code Online (Sandbox Code Playgroud)

现在使用此参数，确实会导致数据流运行器将附加dataflow_graph.pb文件上传到通常的 pipeline.pb 文件旁边的暂存位置。我证实它确实存在于 gcp 存储中。

但是，gcp 数据流中的作业在启动后立即失败，并出现以下错误：

Runnable workflow has no steps specified.

Run Code Online (Sandbox Code Playgroud)

我已经用各种管道尝试过这个标志，甚至是 apache 梁示例管道，并看到了相同的行为。

这可以通过使用字数示例来重现：

mvn archetype:generate \
      -DarchetypeGroupId=org.apache.beam \
      -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
      -DarchetypeVersion=2.11.0 \
      -DgroupId=org.example \
      -DartifactId=word-count-beam \
      -Dversion="0.1" \
      -Dpackage=org.apache.beam.examples \
      -DinteractiveMode=false

Run Code Online (Sandbox Code Playgroud)

cd word-count-beam/

Run Code Online (Sandbox Code Playgroud)

在没有experiments=upload_graph参数的情况下运行它：（如果要运行它，请确保指定您的项目和存储桶）

mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
     -Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \
                  --gcpTempLocation=gs://<your-gcs-bucket>/tmp \
                  --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \
     -Pdataflow-runner

Run Code Online (Sandbox Code Playgroud)

运行它，experiments=upload_graph结果管道失败并显示消息workflow has no steps specified

mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
     -Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \
                  --gcpTempLocation=gs://<your-gcs-bucket>/tmp \
                  --experiments=upload_graph \
                  --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \
     -Pdataflow-runner

Run Code Online (Sandbox Code Playgroud)

现在我希望数据流运行器会指示 gcp 数据流从源代码中指定的存储桶中读取步骤：

https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L881

然而，情况似乎并非如此。有没有人让它起作用，或者找到了一些关于这个功能的文档，可以为我指明正确的方向？

Answer 1

小智 2

该实验已被恢复，消息传递将在 Beam 2.13.0 中得到纠正

恢复公关

归档时间：	7 年，1 月前
查看次数：	1331 次
最近记录：	5 年，10 月前