Dataflow GZIP TextIO ZipException:长度或距离符号太多

Fem*_*ich 5 java gzipinputstream google-cloud-dataflow

使用带有大量压缩文本文件(1000多个文件,大小在100MB和1.5GB之间)的TextIO.Read转换,我们有时会收到以下错误:

java.util.zip.ZipException: too many length or distance symbols at
java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at
java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117) at
java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at
java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at
java.io.BufferedInputStream.read(BufferedInputStream.java:345) at
java.io.FilterInputStream.read(FilterInputStream.java:133) at
java.io.PushbackInputStream.read(PushbackInputStream.java:186) at 
com.google.cloud.dataflow.sdk.runners.worker.TextReader$ScanState.readBytes(TextReader.java:261) at 
com.google.cloud.dataflow.sdk.runners.worker.TextReader$TextFileIterator.readElement(TextReader.java:189) at 
com.google.cloud.dataflow.sdk.runners.worker.FileBasedReader$FileBasedIterator.computeNextElement(FileBasedReader.java:265) at 
com.google.cloud.dataflow.sdk.runners.worker.FileBasedReader$FileBasedIterator.hasNext(FileBasedReader.java:165) at 
com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:169) at 
com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:118) at 
com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:66) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:204) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:151) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:118) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:139) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:124) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at
java.lang.Thread.run(Thread.java:745)
Run Code Online (Sandbox Code Playgroud)

在线搜索相同的ZipException,只会导致此回复:

当热部署程序在将应用程序完全复制到deploy目录之前尝试部署应用程序时,通常会发生Zip文件错误.如果复制文件需要几秒钟,这是相当常见的.解决方案是将文件复制到与应用程序服务器相同的磁盘分区上的临时目录,然后将该文件移动到deploy目录.

有没有其他人遇到类似的例外?或者无论如何我们可以解决这个问题?

Iva*_*sov 6

查看产生错误消息代码,似乎是zlib库(JDK使用的库)不支持您拥有的gzip文件格式的问题.

它看起来是以下错误zlib:即使未使用,也会拒绝保留符号的代码.

不幸的是,除了建议使用另一个实用程序生成这些压缩文件之外,我们可能做的很少.

如果您可以生成一个我们可以用来重现问题的小示例gzip文件,我们可能会看到是否有可能以某种方式解决,但我不会依赖它来成功.