具有Hbase集成的Spark结构化流

HAR*_*NGH 3 hbase scala apache-kafka apache-spark spark-streaming

我们正在对从MySQL收集的kafka数据进行流式处理。现在,所有分析完成后,我想将数据直接保存到Hbase。我已经了解了Spark结构化的流式文档,但是使用Hbase找不到任何接收器。下面是我用来从Kafka读取数据的代码。

 val records = spark.readStream.format("kafka").option("subscribe", "kaapociot").option("kafka.bootstrap.servers", "XX.XX.XX.XX:6667").option("startingOffsets", "earliest").load
 val jsonschema = StructType(Seq(StructField("header", StringType, true),StructField("event", StringType, true)))
 val uschema = StructType(Seq(
             StructField("MeterNumber", StringType, true),
             StructField("Utility", StringType, true),
             StructField("VendorServiceNumber", StringType, true),
             StructField("VendorName", StringType, true),
             StructField("SiteNumber",  StringType, true),
             StructField("SiteName", StringType, true),
             StructField("Location", StringType, true),
             StructField("timestamp", LongType, true),
             StructField("power", DoubleType, true)
             ))
 val DF_Hbase = records.selectExpr("cast (value as string) as Json").select(from_json($"json",schema=jsonschema).as("data")).select("data.event").select(from_json($"event", uschema).as("mykafkadata")).select("mykafkadata.*")
Run Code Online (Sandbox Code Playgroud)

现在最后,我想将DF_Hbase数据帧保存在hbase中。

Moh*_*bdo 6

1-将这些库添加到您的项目中:

      "org.apache.hbase" % "hbase-client" % "2.0.1"
      "org.apache.hbase" % "hbase-common" % "2.0.1"
Run Code Online (Sandbox Code Playgroud)

2-将此特征添加到您的代码中:

   import java.util.concurrent.ExecutorService
   import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory, Put, Table}
   import org.apache.hadoop.hbase.security.User
   import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
   import org.apache.spark.sql.ForeachWriter

   trait HBaseForeachWriter[RECORD] extends ForeachWriter[RECORD] {

     val tableName: String
     val hbaseConfResources: Seq[String]

     def pool: Option[ExecutorService] = None

     def user: Option[User] = None

     private var hTable: Table = _
     private var connection: Connection = _


     override def open(partitionId: Long, version: Long): Boolean = {
       connection = createConnection()
       hTable = getHTable(connection)
       true
     }

     def createConnection(): Connection = {
       val hbaseConfig = HBaseConfiguration.create()
       hbaseConfResources.foreach(hbaseConfig.addResource)
       ConnectionFactory.createConnection(hbaseConfig, pool.orNull,                      user.orNull)

     }

     def getHTable(connection: Connection): Table = {
       connection.getTable(TableName.valueOf(tableName))
     }

     override def process(record: RECORD): Unit = {
       val put = toPut(record)
       hTable.put(put)
     }

     override def close(errorOrNull: Throwable): Unit = {
       hTable.close()
       connection.close()
     }

     def toPut(record: RECORD): Put

   }
Run Code Online (Sandbox Code Playgroud)

3-将其用于您的逻辑:

    val ds = .... //anyDataset[WhatEverYourDataType]

    val query = ds.writeStream
           .foreach(new HBaseForeachWriter[WhatEverYourDataType] {
                            override val tableName: String = "hbase-table-name"
                            //your cluster files, i assume here it is in resources  
                            override val hbaseConfResources: Seq[String] = Seq("core-site.xml", "hbase-site.xml") 

                            override def toPut(record: WhatEverYourDataType): Put = {
                              val key = .....
                              val columnFamaliyName : String = ....
                              val columnName : String = ....
                              val columnValue = ....

                              val p = new Put(Bytes.toBytes(key))
                              //Add columns ... 
                   p.addColumn(Bytes.toBytes(columnFamaliyName),
                               Bytes.toBytes(columnName), 
                               Bytes.toBytes(columnValue))

                              p
                            }

                          }
           ).start()

         query.awaitTermination()
Run Code Online (Sandbox Code Playgroud)


Rob*_*att 0

您正在处理来自 Kafka 的数据吗?或者只是将其泵送到 HBase?可以考虑的一个选项是使用Kafka Connect。这为您提供了一种基于配置文件的方法,用于将 Kafka 与其他系统(包括 HBase)集成。