我想修复/合并我的数据,以便将其保存到每个分区的一个Parquet文件中.我还想使用Spark SQL partitionBy API.所以我可以这样做:
df.coalesce(1).write.partitionBy("entity", "year", "month", "day", "status")
.mode(SaveMode.Append).parquet(s"$location")
Run Code Online (Sandbox Code Playgroud)
我已经测试了这个并且它似乎表现不佳.这是因为在数据集中只有一个分区可以处理,文件的所有分区,压缩和保存都必须由一个CPU内核完成.
在调用coalesce之前,我可以重写这个来手动执行分区(使用带有不同分区值的过滤器).
但是使用标准的Spark SQL API有更好的方法吗?
我在长期运行的Spark Streaming应用程序中获得了以下异常.异常可能在几分钟后发生,但也可能不会发生几天.这是非常一致的输入数据.
我见过这张Jira票,但我不认为这是同一个问题.就是java.lang.IllegalArgumentException这样java.io.IOException: Class not found.
我的应用程序是流数据并使用Spark SQL写入Parquet.
我正在使用Spark 1.5.2.有任何想法吗?
28-01-2016 09:36:00 ERROR JobScheduler:96 - Error generating jobs for time 1453973760000 ms
java.io.IOException: Class not found
at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.a(Unknown Source)
at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.<init>(Unknown Source)
at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:40)
at org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:81)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:187)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:318)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:317)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.map(RDD.scala:317)
at org.apache.spark.streaming.dstream.MappedDStream$$anonfun$compute$1.apply(MappedDStream.scala:35)
at org.apache.spark.streaming.dstream.MappedDStream$$anonfun$compute$1.apply(MappedDStream.scala:35)
at scala.Option.map(Option.scala:145)
at org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) …Run Code Online (Sandbox Code Playgroud)