非法国家例外mongo-hadoop和火花

HDJ*_*HDJ 6 java exception mongodb apache-spark

在尝试使用MongoDB作为输入RDD时,我在org.bson.BasicBSONDecoder._decode中得到"java.lang.IllegalStateException:not ready":

就像:在火花中连接到mongodb时出现异常

这些任务完全完成,看似正确的结果,但每次运行都会得到以下结果.

java.lang.IllegalStateException: not ready
    at org.bson.BasicBSONDecoder._decode(BasicBSONDecoder.java:139)
    at org.bson.BasicBSONDecoder.decode(BasicBSONDecoder.java:123)
    at com.mongodb.hadoop.input.MongoInputSplit.readFields(MongoInputSplit.java:185)
    at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)
    at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)
    at org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:42)
    at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    at org.apache.spark.scheduler.ShuffleMapTask.readExternal(ShuffleMapTask.scala:140)
    at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Run Code Online (Sandbox Code Playgroud)

Loss was due to org.bson.BSONException
org.bson.BSONException: should be impossible
    at org.bson.BasicBSONDecoder.decode(BasicBSONDecoder.java:126)
    at com.mongodb.hadoop.input.MongoInputSplit.readFields(MongoInputSplit.java:185)
    at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)
    at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)
    at org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:42)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    at org.apache.spark.scheduler.ShuffleMapTask.readExternal(ShuffleMapTask.scala:140)
    at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Run Code Online (Sandbox Code Playgroud)

会喜欢任何输入.Theres尝试使用默认返回值捕获每个spark函数块,因此我确定我没有导致任何异常向上传播.

我能想到这一事件的唯一地方就是

file_save.saveAsNewAPIHadoopFile("file:///bogus", Object.class, Object.class, MongoOutputFormat.class, config);
Run Code Online (Sandbox Code Playgroud)

JavaPairRDD<Object, BSONObject> mongoRDD = sc.newAPIHadoopRDD(config, com.mongodb.hadoop.MongoInputFormat.class, Object.class, BSONObject.class);
Run Code Online (Sandbox Code Playgroud)

我很乐意提供更多信息.