Wil*_*ire 12 scala jdbc apache-spark apache-spark-sql
尝试将JDBC DataFrame加载到Spark SQL时,我遇到了非常奇怪的问题.
我在我的笔记本电脑上尝试了几个Spark集群 -  YARN,独立集群和伪分布式模式.它在Spark 1.3.0和1.3.1上都是可重现的.spark-shell在执行代码时和使用时都会出现问题spark-submit.我试过MySQL和MS SQL JDBC驱动程序但没有成功.
考虑以下示例:
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/test"
val t1 = {
  sqlContext.load("jdbc", Map(
    "url" -> url,
    "driver" -> driver,
    "dbtable" -> "t1",
    "partitionColumn" -> "id",
    "lowerBound" -> "0",
    "upperBound" -> "100",
    "numPartitions" -> "50"
  ))
}
到目前为止,架构得到了正确解决:
t1: org.apache.spark.sql.DataFrame = [id: int, name: string]
但是当我评估DataFrame时:
t1.take(1)
发生以下异常:
15/04/29 01:56:44 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.1.42): java.sql.SQLException: No suitable driver found for jdbc:mysql://<hostname>:3306/test
    at java.sql.DriverManager.getConnection(DriverManager.java:689)
    at java.sql.DriverManager.getConnection(DriverManager.java:270)
    at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:158)
    at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:150)
    at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:317)
    at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:309)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
当我尝试在执行程序上打开JDBC连接时:
import java.sql.DriverManager
sc.parallelize(0 until 2, 2).map { i =>
  Class.forName(driver)
  val conn = DriverManager.getConnection(url)
  conn.close()
  i
}.collect()
它完美地运作:
res1: Array[Int] = Array(0, 1)
当我在本地Spark上运行相同的代码时,它也可以完美地运行:
scala> t1.take(1)
...
res0: Array[org.apache.spark.sql.Row] = Array([1,one])
我正在使用预先构建的带有Hadoop 2.4支持的Spark.
重现问题的最简单方法是使用start-all.sh脚本以伪分布式模式启动Spark 并运行以下命令:
/path/to/spark-shell --master spark://<hostname>:7077 --jars /path/to/mysql-connector-java-5.1.35.jar --driver-class-path /path/to/mysql-connector-java-5.1.35.jar
有办法解决这个问题吗?这看起来像一个严重的问题,所以谷歌搜索在这里没有帮助,这很奇怪.
显然这个问题最近被报告过:
https://issues.apache.org/jira/browse/SPARK-6913
问题出在 java.sql.DriverManager 中,它看不到除引导类加载器之外的类加载器加载的驱动程序。
作为临时解决方法,可以将所需的驱动程序添加到执行程序的引导类路径。
更新:此拉取请求修复了问题:https ://github.com/apache/spark/pull/5782
更新 2:修复合并到 Spark 1.4
| 归档时间: | 
 | 
| 查看次数: | 11551 次 | 
| 最近记录: |