提供的 Maven 坐标必须采用“groupId:artifactId:version”形式 PySpark 和 Kafka

Jim*_*lay 2 apache-kafka apache-spark pyspark

将 Kafka 消息转换为数据帧时,将包作为参数传递时出现错误。

from pyspark.sql import SparkSession, Row
from pyspark.context import SparkContext
from kafka import KafkaConsumer
import os

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0.jar: org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:jar:2.1.1 pyspark-shell'pyspark-shell'

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

df = spark \
  .read \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "Jim_Topic") \
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
Run Code Online (Sandbox Code Playgroud)

错误 ::::::::::::::::::::::::::::::::::::::::::::::

    ::          UNRESOLVED DEPENDENCIES         ::

    ::::::::::::::::::::::::::::::::::::::::::::::

    :: org.apache.spark#spark-sql-kafka-0-10_2.11;2.2.0.jar: not found

    ::::::::::::::::::::::::::::::::::::::::::::::
Run Code Online (Sandbox Code Playgroud)

Gio*_*ous 7

正如异常表明的那样,您的依赖项之一有拼写错误。


org.apache.spark:spark-sql-kafka-0-10_2.12:jar:
Run Code Online (Sandbox Code Playgroud)

缺少版本(并且它还有一个不必要的:)。以下应该可以解决问题:

org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
Run Code Online (Sandbox Code Playgroud)

完整的依赖关系将变为:

'--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0,org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.1.1'
Run Code Online (Sandbox Code Playgroud)