Apr*_*ril 14 hive apache-spark apache-spark-sql spark-thriftserver
我正在本地运行spark并希望访问位于远程Hadoop集群中的Hive表.
我可以通过SPARK_HOME下的直线访问蜂巢表
[ml@master spark-2.0.0]$./bin/beeline
Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
Enter username for jdbc:hive2://remote_hive:10000: root
Enter password for jdbc:hive2://remote_hive:10000: ******
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ml/spark/spark-2.0.0/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/10/12 19:06:39 INFO jdbc.Utils: Supplied authorities: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.Utils: Resolved authority: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://remote_hive:10000
Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://remote_hive:10000>
Run Code Online (Sandbox Code Playgroud)
如何从spark以编程方式访问远程配置单元表?
Ram*_*ram 19
Spark直接连接到Hive Metastore,而不是通过HiveServer2.要配置这个,
穿上hive-site.xml你的classpath,并指定hive.metastore.uri你的hive Metastore托管的地方.另请参阅如何在SparkSQL中以编程方式连接到Hive Metastore?
导入org.apache.spark.sql.hive.HiveContext,因为它可以在Hive表上执行SQL查询.
限定 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
验证sqlContext.sql("show tables")它是否有效
看看远程连接apache spark和apache hive.
请注意,直线也通过jdbc连接.从你的日志中它自己很明显.
[ml @ master spark-2.0.0] $./ bin/beeline Beeline版本1.2.1.spark2由Apache Hive beeline>!connect jdbc:hive2:// remote_hive:10000
连接到jdbc:hive2:// remote_hive:10000
所以请看看这篇有趣的文章
目前HiveServer2驱动程序不允许我们使用"Sparkling"方法1和2,我们只能依赖方法3
下面是可以实现的示例代码片段
通过HiveServer2 JDBC连接将数据从一个Hadoop集群(也称为"远程")加载到另一个集群(我的Spark居住的地方也称为"国内").
import java.sql.Timestamp
import scala.collection.mutable.MutableList
case class StatsRec (
first_name: String,
last_name: String,
action_dtm: Timestamp,
size: Long,
size_p: Long,
size_d: Long
)
val conn: Connection = DriverManager.getConnection(url, user, password)
val res: ResultSet = conn.createStatement
.executeQuery("SELECT * FROM stats_201512301914")
val fetchedRes = MutableList[StatsRec]()
while(res.next()) {
var rec = StatsRec(res.getString("first_name"),
res.getString("last_name"),
Timestamp.valueOf(res.getString("action_dtm")),
res.getLong("size"),
res.getLong("size_p"),
res.getLong("size_d"))
fetchedRes += rec
}
conn.close()
val rddStatsDelta = sc.parallelize(fetchedRes)
rddStatsDelta.cache()
// Basically we are done. To check loaded data:
println(rddStatsDelta.count)
rddStatsDelta.collect.take(10).foreach(println)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
35532 次 |
| 最近记录: |