与aws-java-sdk链接时,在读取json文件时发生Spark崩溃

Bor*_*ris 8 jackson apache-spark aws-java-sdk

让我们config.json成为一个小的json文件:

{
    "toto": 1
}
Run Code Online (Sandbox Code Playgroud)

我做了一个简单的代码,用于读取json文件sc.textFile(因为该文件可以在S3,本地或HDFS上,因此textFile很方便)

import org.apache.spark.{SparkContext, SparkConf}

object testAwsSdk {
  def main( args:Array[String] ):Unit = {
    val sparkConf = new SparkConf().setAppName("test-aws-sdk").setMaster("local[*]")
    val sc = new SparkContext(sparkConf)
    val json = sc.textFile("config.json") 
    println(json.collect().mkString("\n"))
  }
}
Run Code Online (Sandbox Code Playgroud)

SBT文件仅拉取spark-core

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "1.5.1" % "compile"
)
Run Code Online (Sandbox Code Playgroud)

程序按预期工作,在标准输出上写入config.json的内容.

现在我想链接aws-java-sdk,亚马逊的sdk来访问S3.

libraryDependencies ++= Seq(
  "com.amazonaws" % "aws-java-sdk" % "1.10.30" % "compile",
  "org.apache.spark" %% "spark-core" % "1.5.1" % "compile"
)
Run Code Online (Sandbox Code Playgroud)

执行相同的代码,spark抛出以下异常.

Exception in thread "main" com.fasterxml.jackson.databind.JsonMappingException: Could not find creator property with name 'id' (in class org.apache.spark.rdd.RDDOperationScope)
 at [Source: {"id":"0","name":"textFile"}; line: 1, column: 1]
    at com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:148)
    at com.fasterxml.jackson.databind.DeserializationContext.mappingException(DeserializationContext.java:843)
    at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.addBeanProps(BeanDeserializerFactory.java:533)
    at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.buildBeanDeserializer(BeanDeserializerFactory.java:220)
    at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.createBeanDeserializer(BeanDeserializerFactory.java:143)
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer2(DeserializerCache.java:409)
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer(DeserializerCache.java:358)
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:265)
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:245)
    at com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:143)
    at com.fasterxml.jackson.databind.DeserializationContext.findRootValueDeserializer(DeserializationContext.java:439)
    at com.fasterxml.jackson.databind.ObjectMapper._findRootDeserializer(ObjectMapper.java:3666)
    at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3558)
    at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2578)
    at org.apache.spark.rdd.RDDOperationScope$.fromJson(RDDOperationScope.scala:82)
    at org.apache.spark.rdd.RDDOperationScope$$anonfun$5.apply(RDDOperationScope.scala:133)
    at org.apache.spark.rdd.RDDOperationScope$$anonfun$5.apply(RDDOperationScope.scala:133)
    at scala.Option.map(Option.scala:145)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:133)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
    at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1012)
    at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:827)
    at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:825)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
    at org.apache.spark.SparkContext.textFile(SparkContext.scala:825)
    at testAwsSdk$.main(testAwsSdk.scala:11)
    at testAwsSdk.main(testAwsSdk.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Run Code Online (Sandbox Code Playgroud)

读取堆栈时,似乎当链接aws-java-sdk时,sc.textFile检测到该文件是json文件并尝试使用jackson解析它,假设某种格式,当然无法找到.我需要链接aws-java-sdk,所以我的问题是:

1-为什么添加aws-java-sdk修改了行为spark-core

2-是否有解决方法(文件可以在HDFS,S3或本地)?

Bor*_*ris 10

谈到亚马逊的支持.这是杰克逊图书馆的依赖性问题.在SBT中,覆盖杰克逊:

libraryDependencies ++= Seq( 
"com.amazonaws" % "aws-java-sdk" % "1.10.30" % "compile",
"org.apache.spark" %% "spark-core" % "1.5.1" % "compile"
) 

dependencyOverrides ++= Set( 
"com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4" 
) 
Run Code Online (Sandbox Code Playgroud)

他们的答案: 我们在Mac,Ec2(redhat AMI)实例和EMR(Amazon Linux)上完成了这项工作.3种不同的环境.问题的根本原因是sbt构建依赖图,然后通过逐出旧版本并选择最新版本的依赖库来处理版本冲突问题.在这种情况下,spark需要2.4版本的jackson库,而AWS SDK需要2.5.因此存在版本冲突,并且sbt驱逐了spark的依赖版本(更旧版本)并选择了AWS SDK版本(这是最新版本).