如何在Scala中同时使用gcs-connector和google-cloud-storage

And*_*ndy 3 scala google-cloud-storage apache-spark

我试图列出存储桶中的所有对象,然后将其中一些或全部读取为CSV。我现在已经花了两天时间,试图同时做两件事,但是如果我使用Google的库,一次只能工作一遍。

我认为问题出在Google自己的库之间不兼容,但是我不确定。首先,我认为我应该展示我如何做每件事。

这就是我读取单个文件的方式。在我的Scala版本中,您可以使用带有以下内容的gs://网址spark.read.csv

val jsonKeyFile = "my-local-keyfile.json"
ss.sparkContext.hadoopConfiguration.set("google.cloud.auth.service.account.json.keyfile", jsonKeyFile)

spark.read
  .option("header", "true")
  .option("sep", ",")
  .option("inferSchema", "false")
  .option("mode", "FAILFAST")
  .csv(gcsFile)
Run Code Online (Sandbox Code Playgroud)

这实际上是单独工作的,而我从中得到了有效的DF。然后,当我尝试添加Google的存储库时,就会出现问题:

libraryDependencies += "com.google.cloud" % "google-cloud-storage" % "1.70.0"
Run Code Online (Sandbox Code Playgroud)

如果我尝试再次运行相同的代码,则会从.csv调用中得到这个坏男孩:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/05/14 16:38:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

An exception or error caused a run to abort: Class com.google.common.base.Suppliers$SupplierOfInstance does not implement the requested interface java.util.function.Supplier 
java.lang.IncompatibleClassChangeError: Class com.google.common.base.Suppliers$SupplierOfInstance does not implement the requested interface java.util.function.Supplier
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getGcsFs(GoogleHadoopFileSystemBase.java:1488)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1659)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:683)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:646)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
    ...(lots more trace, probably irrelevant)
Run Code Online (Sandbox Code Playgroud)

然后,您可能会问,为什么不不使用该库呢?嗯,这是列出存储桶中对象的代码:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/05/14 16:38:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

An exception or error caused a run to abort: Class com.google.common.base.Suppliers$SupplierOfInstance does not implement the requested interface java.util.function.Supplier 
java.lang.IncompatibleClassChangeError: Class com.google.common.base.Suppliers$SupplierOfInstance does not implement the requested interface java.util.function.Supplier
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getGcsFs(GoogleHadoopFileSystemBase.java:1488)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1659)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:683)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:646)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
    ...(lots more trace, probably irrelevant)
Run Code Online (Sandbox Code Playgroud)

而且,我还没有找到一种无需指定库即可轻松实现此目的的方法。

And*_*ndy 5

我发现了导致问题的原因。Guava:27.1-android在某些时候是某个库的依赖项,我不知道它是什么以及如何到达那里的,但是它正在使用中。在此版本的Guava中,Supplier接口不会扩展Java Supplier接口。

我通过将Guava 27.1-jre添加到我的依赖项来修复它。我不知道命令是否重要,但是我现在不敢碰任何东西。这是我放置的位置:

libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.5" % "test"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.1" % "provided"
libraryDependencies += "com.google.guava" % "guava" % "27.1-jre"
libraryDependencies += "com.google.cloud" % "google-cloud-storage" % "1.70.0"
//BQ samples as of 27feb2019 use hadoop2 but hadoop3 seems to work fine and are recommended elsewhere
libraryDependencies += "com.google.cloud.bigdataoss" % "bigquery-connector" % "hadoop3-0.13.16" % "provided"
libraryDependencies += "com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop3-1.9.16" % "provided"
Run Code Online (Sandbox Code Playgroud)

希望这可以防止其他可怜的灵魂在此bs上花费2天。