Zac*_*ham 2 scala apache-kafka apache-spark spark-structured-streaming
我正在尝试使用CSV设置Kafka流,以便可以将其流式传输到Spark。但是,我不断
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
Run Code Online (Sandbox Code Playgroud)
我的代码看起来像这样
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.execution.streaming.FileStreamSource.Timestamp
import org.apache.spark.sql.types._
object SpeedTester {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local[4]").appName("SpeedTester").config("spark.driver.memory", "8g").getOrCreate()
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
import spark.implicits._
val mySchema = StructType(Array(
StructField("incident_id", IntegerType),
StructField("date", StringType),
StructField("state", StringType),
StructField("city_or_county", StringType),
StructField("n_killed", IntegerType),
StructField("n_injured", IntegerType)
))
val streamingDataFrame = spark.readStream.schema(mySchema).csv("C:/Users/zoldham/IdeaProjects/flinkpoc/Data/test")
streamingDataFrame.selectExpr("CAST(incident_id AS STRING) AS key",
"to_json(struct(*)) AS value").writeStream
.format("kafka")
.option("topic", "testTopic")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "C:/Users/zoldham/IdeaProjects/flinkpoc/Data")
.start()
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "testTopic").load()
val df1 = df.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)").as[(String, Timestamp)]
.select(from_json(col("value"), mySchema).as("data"), col("timestamp"))
.select("data.*", "timestamp")
df1.writeStream
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
}
}
Run Code Online (Sandbox Code Playgroud)
我的build.sbt文件看起来像这样
name := "Spark POC"
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"
libraryDependencies += "com.microsoft.sqlserver" % "mssql-jdbc" % "6.2.1.jre8"
libraryDependencies += "org.scalafx" %% "scalafx" % "8.0.144-R12"
libraryDependencies += "org.apache.ignite" % "ignite-core" % "2.5.0"
libraryDependencies += "org.apache.ignite" % "ignite-spring" % "2.5.0"
libraryDependencies += "org.apache.ignite" % "ignite-indexing" % "2.5.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10_2.11" % "2.3.0"
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "0.11.0.1"
Run Code Online (Sandbox Code Playgroud)
是什么导致该错误?如您所见,我清楚地将Kafka包含在库依赖项中,甚至遵循了官方指南。这是堆栈跟踪:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
Run Code Online (Sandbox Code Playgroud)
您需要添加缺少的依赖项
"org.apache.spark" %% "spark-sql-kafka-0-10" % "2.3.0"
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
7944 次 |
| 最近记录: |