无法从pyspark连接到Mongo

Kri*_*hna 2 python mongodb pyspark

我正在尝试使用pyspark连接到MongoDB.下面是我正在使用的代码

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext

sparkConf = SparkConf().setAppName("App")
sparkConf.set("spark.mongodb.input.uri", "mongodb://127.0.0.1/mydb.test")
sc = SparkContext(conf = sparkConf)
sqlContext =SQLContext(sc)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
Run Code Online (Sandbox Code Playgroud)

我面临以下错误

py4j.protocol.Py4JJavaError: An error occurred while calling o25.load.
java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource.
Run Code Online (Sandbox Code Playgroud)

Wan*_*iar 5

无法找到数据源:com.mongodb.spark.sql.DefaultSource.

此错误表示PySpark无法找到MongoDB Spark Connector.

如果要pyspark直接调用,请确保mongo-spark-connector在packages参数中指定.例如:

./bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
Run Code Online (Sandbox Code Playgroud)

如果您没有直接调用pyspark(例如从Eclipse等IDE),则必须修改Spark配置spark.jars.packages以指定依赖项.

spark-defaults.conf文件中:

spark.jars.packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
Run Code Online (Sandbox Code Playgroud)

或者,您可以尝试更改代码中的配置:

SparkConf().set("spark.jars.packages","org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
Run Code Online (Sandbox Code Playgroud)

要么:

SparkSession.builder.config('spark.jars.packages','org.mongodb.spark:mongo-spark-connector_2.11:2.2.0' ).getOrCreate()
Run Code Online (Sandbox Code Playgroud)