Amr*_*Jha 5 apache-spark parquet apache-spark-sql
在 Spark 中读取镶木地板文件时,如果您遇到以下问题。
应用程序 > 线程“main”org.apache.spark.SparkException 中的异常:由于阶段失败而中止作业:阶段 2.0 中的任务 0 失败 4 次,最近失败:阶段 2.0 中丢失任务 0.3(TID 44、10.23.5.196、 executor 2): java.io.EOFException: 到达流的末尾,还有 193212 个字节可以读取 App > at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) App > at org.apache.parquet .io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) 应用程序 > 在 org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) 应用程序 > 在 org.apache.parquet.hadoop.List.ParquetunkFileReader (ParquetFileReader.java:1174) 应用程序 > 在 org.apache.parquet.hadoop.ParquetFileReader。readNextRowGroup(ParquetFileReader.java:805) App > 在 org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) App > 在 org.apache.spark.sql.execution.datasources. parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256) App > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) App > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) App > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) App > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:124) App > at org.apache。 spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:215)spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) App > 在 org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:301) App > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) App > at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) App > 在 org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:124) 应用程序 > 在 org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator (FileScanRDD.scala:215)spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) App > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader) > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader) > App > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) App > at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) App > 在 org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:124) 应用程序 > 在 org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator (FileScanRDD.scala:215)checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) App > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256) App > at org.apache.spark.sql.execution.datasources parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) App > at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) App > at org.apache.spark.sql.execution。 datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:124) App > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:215)checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) App > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256) App > at org.apache.spark.sql.execution.datasources parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) App > at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) App > at org.apache.spark.sql.execution。 datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:124) App > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:215)spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) App > at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) App > at org. apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:124) App > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala :215)spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) App > at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) App > at org. apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:124) App > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala :215)FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:215)FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:215)
对于以下 spark 命令:
val df = spark.read.parquet("s3a://.../file.parquet")
df.show(5, false)
Run Code Online (Sandbox Code Playgroud)
小智 9
对我来说,上面的方法并没有解决问题,但下面的方法却做到了:
--conf spark.hadoop.fs.s3a.experimental.input.fadvise=sequential
Run Code Online (Sandbox Code Playgroud)
小智 2
我认为你可以绕过这个问题
--conf spark.sql.parquet.enableVectorizedReader=false
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1736 次 |
| 最近记录: |