Kar*_*ult 1 sbt-assembly apache-spark parquet
我有以下代码:
val testRDD: RDD[(String, Vector)] = sc.parallelize(testArray)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val df = testRDD.toDF()
df.write.parquet(path)
Run Code Online (Sandbox Code Playgroud)
使用以下 build.sbt:
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.1"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.6.1"
// META-INF discarding
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case "reference.conf" => MergeStrategy.concat
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
}
Run Code Online (Sandbox Code Playgroud)
当我用 sbt-assembly 构建它时(我有 addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")),然后我运行它,我得到一个错误:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: parquet. Please find packages at http://spark-packages.org
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:219)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)
at InductionService.Application$.main(ParquetTest.scala:65)
Run Code Online (Sandbox Code Playgroud)
但是,如果我使用 IntelliJ Idea 构建它(正常构建,而不是像 sbt 程序集那样的胖 JAR 类型),并在该 IDE 中调试它,它实际上可以工作。很明显,我使用 sbt 程序集构建它的方式有问题,但我不知道如何修复它。有任何想法吗?
我怀疑 build.sbt 中的合并 inf 丢弃代码可能是原因,但我需要该代码,否则我无法使用 sbt-assembly 进行构建。(它抱怨重复...)
我有问题。META-INF 中的services文件夹有一些合并问题。我可以通过在 MergeStrategy 上添加规则来解决这个问题:
case n if n.contains("services") => MergeStrategy.concat
Run Code Online (Sandbox Code Playgroud)
这就是我所拥有的,现在它可以工作了:
val meta = """META.INF(.)*""".r
assemblyMergeStrategy in assembly := {
case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
case PathList(ps @ _*) if ps.last endsWith ".html" => MergeStrategy.first
case n if n.contains("services") => MergeStrategy.concat
case n if n.startsWith("reference.conf") => MergeStrategy.concat
case n if n.endsWith(".conf") => MergeStrategy.concat
case meta(_) => MergeStrategy.discard
case x => MergeStrategy.first
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1133 次 |
| 最近记录: |