我将大小约为650GB的BigQuery数据集导出到GCS上的Avro文件,并运行数据流程序来处理这些Avro文件.但是,即使只处理了一个大小约为1.31GB的Avro文件,也会遇到OutOfMemoryError异常.
我收到以下错误消息,似乎异常来自AvroIO和Avro库:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:260)
at org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:341)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
at com.google.cloud.dataflow.sdk.runners.worker.AvroReader$AvroFileIterator.next(AvroReader.java:143)
at com.google.cloud.dataflow.sdk.runners.worker.AvroReader$AvroFileIterator.next(AvroReader.java:113)
at com.google.cloud.dataflow.sdk.util.ReaderUtils.readElemsFromReader(ReaderUtils.java:37)
at com.google.cloud.dataflow.sdk.io.AvroIO.evaluateReadHelper(AvroIO.java:638)
at com.google.cloud.dataflow.sdk.io.AvroIO.access$000(AvroIO.java:118)
at com.google.cloud.dataflow.sdk.io.AvroIO$Read$Bound$1.evaluate(AvroIO.java:294)
at com.google.cloud.dataflow.sdk.io.AvroIO$Read$Bound$1.evaluate(AvroIO.java:290)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.visitTransform(DirectPipelineRunner.java:611)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:200)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:196)
at com.google.cloud.dataflow.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:109)
at com.google.cloud.dataflow.sdk.Pipeline.traverseTopologically(Pipeline.java:204)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.run(DirectPipelineRunner.java:584)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:328)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:70)
at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:145)
at com.htc.studio.bdi.dataflow.ActTranGenerator.main(ActTranGenerator.java:224)
Run Code Online (Sandbox Code Playgroud)
对此例外的任何建议?
谢谢!
您正在使用在本地计算机上运行的DirectPipelineRunner.此模式完全在内存中运行,最适合用于测试或开发小型数据集.直接管道执行可能需要在内存中保留多个数据副本(取决于您的确切算法),因此我不建议将其用于大型文件.而是指定--runner = BlockingDataflowPipelineRunner以通过Dataflow服务运行.
此信息与您的情况没有直接关系,但在使用DataflowPipelineRunner或BlockingDataflowPipelineRunner时可能对其他遇到OOM的人有所帮助:
OutOfMemory异常可能很难诊断,因为:(1)内存耗尽的位置可能不是占用大量内存的位置.(2)由于Dataflow优化管道的方式,来自管道的不同逻辑组件的ParDos可能在同一JVM中一起执行.因此,您可能必须在工作日志中查找共置的DoFns,以确定哪个DoFn实际上占用了所有内存.
OOM的一个常见原因是使用DoFn处理KV>,该DoFn试图将所有V保留在内存中(例如在Collection中).这不会扩展到可能具有相同键的许多值的情况.
如果没有算法问题并且您只需要具有更多内存的工作者,则可以使用以下内容调整VM实例类型: - workerMachineType = n1-standard-4
归档时间: |
|
查看次数: |
1144 次 |
最近记录: |