我不确定这是否是一个错误,所以如果你做这样的事情
// d:spark.RDD[String]
d.distinct().map(x => d.filter(_.equals(x)))
Run Code Online (Sandbox Code Playgroud)
你会得到一个Java NPE.但是如果你collect马上做了distinct,一切都会好的.
我正在使用spark 0.6.1.
我们正在尝试提交一个火花工作(火花2.0,hadoop 2.7.2),但由于某种原因,我们在EMR中收到了相当神秘的NPE.一切都像scala程序一样运行,所以我们不确定是什么导致了这个问题.这是堆栈跟踪:
18:02:55,271 ERROR Utils:91 - 在org.apache.spark.sql.catalyst的org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.agg_doAggregateWithKeys $(未知来源)中止任务java.lang.NullPointerException .expressions.GeneratedClass $ GeneratedIterator.processNext(未知来源)org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $ $ anon $ 1.hasNext(WholeStageCodegenExec.scala:370)at scala.collection.Iterator $$ anon $ 12.hasNext(Iterator.scala:438)at org.apache.spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply $ mcV $ sp(WriterContainer.scala:253)位于org.apache.spark的org.apache.spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply(WriterContainer.scala:252). sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply(WriterContainer.scala:252)at org.apache.spark.util.Utils $ .tryWithSafeFinallyA ndFailureCallbacks(Utils.scala:1325)org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ 1 $$ anonfun $应用$ mcV $ sp $ 1.apply(InsertIntoHadoopFsRelationCommand.scala:143)在org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ 1 $$ anonfun $ apply …