使用Scala的Spark + Play框架的Guava依赖性错误

Fel*_*ipe 1 scala playframework apache-spark

我有一个使用Scala 2.11.8和Spark"spark-core"%"2.2.0"和"spark-sql"%"2.2.0"的Play网络应用程序.我正在尝试阅读包含电影评级的文件并对其进行一些转换.当我使用该函数分割tabs(movieLines.map(x => (x.split("\t")(1).toInt, 1)))时,我得到一个错误,我猜这是因为guava lib依赖.我想这是因为我在谷歌上做的所有搜索都显示了一些基于此的修复.但我无法弄清楚如何排除一些番石榴依赖性.

这是我的代码:

def popularMovies() = Action { implicit request: Request[AnyContent] =>
    Util.downloadSourceFile("downloads/ml-100k.zip", "http://files.grouplens.org/datasets/movielens/ml-100k.zip")
    Util.unzip("downloads/ml-100k.zip")

    val sparkContext = SparkCommons.sparkSession.sparkContext
    println("got sparkContext")

    val movieLines = sparkContext.textFile("downloads/ml-100k/u.data")
    println("popularMovies")
    println(movieLines)

    // Map to (movieID , 1) tuples
    val movieTuples = movieLines.map(x => (x.split("\t")(1).toInt, 1))
    println("movieTuples")
    println(movieTuples)

    // Count up all the 1's for each movie
    val movieCounts = movieTuples.reduceByKey((x, y) => x + y)
    println("movieCounts")
    println(movieCounts)

    // Flip (movieId, count) to (count, movieId)
    val movieCountFlipped = movieCounts.map(x => (x._2, x._1))
    println(movieCountFlipped)

    // Sort
    val sortedMovies = movieCountFlipped.sortByKey()
    println(sortedMovies)

    // collect and print the result
    val results = sortedMovies.collect().toList.mkString(",\n")
    println(results)

    Ok("[" + results + "]")
  }
Run Code Online (Sandbox Code Playgroud)

和错误:

[error] application - 

! @76oh9h40m - Internal server error, for (GET) [/api/popularMovies] ->

play.api.http.HttpErrorHandlerExceptions$$anon$1: Execution exception[[RuntimeException: java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.<init>()V from class org.apache.hadoop.mapred.FileInputFormat]]
    at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:255)
    at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:180)
    at play.core.server.AkkaHttpServer$$anonfun$3.applyOrElse(AkkaHttpServer.scala:311)
    at play.core.server.AkkaHttpServer$$anonfun$3.applyOrElse(AkkaHttpServer.scala:309)
    at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:346)
    at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:345)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
    at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
    at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
Caused by: java.lang.RuntimeException: java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.<init>()V from class org.apache.hadoop.mapred.FileInputFormat
    at play.api.mvc.ActionBuilder$$anon$2.apply(Action.scala:424)
    at play.api.mvc.Action$$anonfun$apply$2.apply(Action.scala:96)
    at play.api.mvc.Action$$anonfun$apply$2.apply(Action.scala:89)
    at play.api.libs.streams.StrictAccumulator$$anonfun$mapFuture$2$$anonfun$1.apply(Accumulator.scala:174)
    at play.api.libs.streams.StrictAccumulator$$anonfun$mapFuture$2$$anonfun$1.apply(Accumulator.scala:174)
    at scala.util.Try$.apply(Try.scala:192)
    at play.api.libs.streams.StrictAccumulator$$anonfun$mapFuture$2.apply(Accumulator.scala:174)
    at play.api.libs.streams.StrictAccumulator$$anonfun$mapFuture$2.apply(Accumulator.scala:170)
    at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:52)
    at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:52)
Caused by: java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.<init>()V from class org.apache.hadoop.mapred.FileInputFormat
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:312)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
Run Code Online (Sandbox Code Playgroud)

Fel*_*ipe 5

我添加了这个依赖项,它解决了我的问题.

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.7.2"

  • 非常感谢!我和你一样遇到了同样的问题,我想我必须弄清楚如何遮挡番石榴派对的Play或Spark.而且我完全无法弄清楚如何这样做,因为尝试在超级罐中单独部署Play OR Spark会引发一系列问题,更别说两者了!你是怎么找到这个解决方案的?你知道它为什么有效吗? (2认同)