Gis*_*gen 8 hadoop amazon-s3 amazon-web-services apache-spark
我在 DC/OS 上有一个 Spark 集群,我正在运行一个从 S3 读取的 Spark 作业。版本如下:
我通过执行以下操作读入数据:
`val hadoopConf = sparkSession.sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3a.endpoint", Config.awsEndpoint)
hadoopConf.set("fs.s3a.access.key", Config.awsAccessKey)
hadoopConf.set("fs.s3a.secret.key", Config.awsSecretKey)
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val data = sparkSession.read.parquet("s3a://" + "path/to/file")
Run Code Online (Sandbox Code Playgroud)
` 我得到的错误是:
Exception in thread "main" java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:215)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:138)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:170)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:44)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:321)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:559)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:543)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:809)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:182)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:207)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Run Code Online (Sandbox Code Playgroud)
仅当我将其作为 JAR 提交到集群时,此作业才会失败。如果我在本地或在 docker 容器中运行代码,它不会失败并且完全能够读入数据。
如果有人能帮助我解决这个问题,我将不胜感激!
小智 0
我还面临着在 Spark 集群(kubernetes)上运行 docker 镜像的问题(不完全相同的例外),它在本地完美运行。然后我更改了 build.sbt 程序集和 hadoop 版本。
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.3.0"
libraryDependencies += "com.databricks" %% "spark-avro" % "4.0.0"
libraryDependencies += "com.databricks" %% "spark-csv" % "1.5.0"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.8.9"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.8.9"
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.8.9"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.1.1"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-s3" % "1.11.271"
dependencyOverrides += "org.apache.hadoop" % "hadoop-hdfs" % "3.1.1"
dependencyOverrides += "org.apache.hadoop" % "hadoop-client" % "3.1.1"
assemblyMergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
case "log4j.properties" => MergeStrategy.discard
case m if m.toLowerCase.startsWith("meta-inf/services/") => MergeStrategy.filterDistinctLines
case PathList("META-INF", "services", "org.apache.hadoop.fs.s3a.S3AFileSystem") => MergeStrategy.filterDistinctLines
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}
Run Code Online (Sandbox Code Playgroud)
但不确定这是否适合你。因为相同的代码不适用于 aws-EKS 机器,并且如果 hadoop 版本为 2.8.1,则会引发相同的异常。Hadoop和aws版本也相同,在本地工作正常,因此尝试联系aws团队寻求帮助。
| 归档时间: |
|
| 查看次数: |
3269 次 |
| 最近记录: |