我试图从 PySpark 连接到 MongoDB Atlas,但遇到以下问题:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
sc = SparkContext
spark = SparkSession.builder \
.config("spark.mongodb.input.uri", "mongodb+srv://#USER#:#PASS#@test00-la3lt.mongodb.net/db.BUSQUEDAS?retryWrites=true") \
.config("spark.mongodb.output.uri", "mongodb+srv://#USER#:#PASS#@test00-la3lt.mongodb.net/db.BUSQUEDAS?retryWrites=true") \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
Run Code Online (Sandbox Code Playgroud)
返回此代码的错误是这样的:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-3-346df2de8d22> in <module>()
----> 1 df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
c:\users\andres\appdata\local\programs\python\python36\lib\site-packages\pyspark\sql\readwriter.py in load(self, path, format, schema, **options)
170 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
171 else:
--> 172 return self._df(self._jreader.load())
173
174 @since(1.4)
c:\users\andres\appdata\local\programs\python\python36\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
1255 answer = …Run Code Online (Sandbox Code Playgroud) 从Spark 下载页面,如果我下载v2.0.1的tar 文件,我会看到它包含一些我认为包含在我的应用程序中有用的 jar。
如果我下载v1.6.2的tar 文件,我在那里找不到 jars 文件夹。是否有我应该从该站点使用的替代包装类型?我目前正在选择默认值(为 Hadoop 2.6 预构建)。或者,我在哪里可以找到那些 Spark 罐子 - 我应该从http://spark-packages.org单独获取每个罐子吗?
这是我想使用的一组指示性罐子:
我正在尝试通过Apache Spark master从Mongo DB读取数据.
我正在使用3台机器:
应用程序(M3)正在连接到spark master,如下所示:
_sparkSession = SparkSession.builder.master(masterPath).appName(appName)\
.config("spark.mongodb.input.uri", "mongodb://10.0.3.150/db1.data.coll")\
.config("spark.mongodb.output.uri", "mongodb://10.0.3.150/db1.data.coll").getOrCreate()
Run Code Online (Sandbox Code Playgroud)
应用程序(M3)正在尝试从DB读取数据:
sqlContext = SQLContext(_sparkSession.sparkContext)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").option("uri","mongodb://user:pass@10.0.3.150/db1.data?readPreference=primaryPreferred").load()
Run Code Online (Sandbox Code Playgroud)
但是因为这个例外而失败:
py4j.protocol.Py4JJavaError: An error occurred while calling o56.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:594)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at …Run Code Online (Sandbox Code Playgroud)