我的团队开发的数据流管道突然开始卡住,停止处理事件。他们的工作日志充满警告消息,指出一个特定步骤被卡住了。奇怪的是,失败的步骤是不同的,一个是BigQuery输出,另一个是Cloud Storage输出。
以下是我们收到的日志消息:
对于BigQuery输出:
Processing stuck in step <STEP_NAME>/StreamingInserts/StreamingWriteTables/StreamingWrite for at least <TIME> without outputting or completing in state finish
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:765)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:829)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:131)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:103)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn$DoFnInvoker.invokeFinishBundle(Unknown Source)
Run Code Online (Sandbox Code Playgroud)
对于云存储输出:
Processing stuck in step <STEP_NAME>/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles for at least <TIME> without outputting or completing in state process
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:421)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:287)
at org.apache.beam.sdk.io.FileBasedSink$Writer.close(FileBasedSink.java:1007)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:726)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)
Run Code Online (Sandbox Code Playgroud)
所有应用程序都已被排干并重新部署,但是一段时间(3到4个小时)后发生了相同的事情。其中一些已经运行了40多个天,而他们突然进入了代码库,而没有对代码进行任何更改。
我想寻求帮助以了解此问题的原因。这些是存在这些问题的某些Dataflow作业的以下ID:
卡在BigQuery输出中: …