我正在尝试使用SparkSession将文件的JSON数据转换为使用Spark Notebook的RDD.我已经有了JSON文件.
val spark = SparkSession
.builder()
.appName("jsonReaderApp")
.config("config.key.here", configValueHere)
.enableHiveSupport()
.getOrCreate()
val jread = spark.read.json("search-results1.json")
Run Code Online (Sandbox Code Playgroud)
我很新兴火花,不知道该用什么config.key.here和configValueHere.
我在使用magellan-1.0.4-s_2.11spark笔记本电脑时遇到了麻烦.我从网上下载JAR https://spark-packages.org/package/harsha2010/magellan并试图放置SPARK_HOME/bin/spark-shell --packages harsha2010:magellan:1.0.4-s_2.11在Start of Customized Settingsbin文件夹的火花笔记本文件的部分.
这是我的进口
import magellan.{Point, Polygon, PolyLine}
import magellan.coord.NAD83
import org.apache.spark.sql.magellan.MagellanContext
import org.apache.spark.sql.magellan.dsl.expressions._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
Run Code Online (Sandbox Code Playgroud)
而我的错误......
<console>:71: error: object Point is not a member of package org.apache.spark.sql.magellan
import magellan.{Point, Polygon, PolyLine}
^
<console>:72: error: object coord is not a member of package org.apache.spark.sql.magellan
import magellan.coord.NAD83
^
<console>:73: error: object MagellanContext is not a member of package org.apache.spark.sql.magellan
import org.apache.spark.sql.magellan.MagellanContext
Run Code Online (Sandbox Code Playgroud)
然后,我尝试通过将其放入类似的任何其他库来导入新库main script:
$lib_dir/magellan-1.0.4-s_2.11.jar"
Run Code Online (Sandbox Code Playgroud)
这不起作用,我一直在挠头,想知道我做错了什么.如何将magellan等库导入spark笔记本?
; WITH Hierarchy as
(
select distinct PersonnelNumber
, Email
, ManagerEmail
from dimstage
union all
select e.PersonnelNumber
, e.Email
, e.ManagerEmail
from dimstage e
join Hierarchy as h on e.Email = h.ManagerEmail
)
select * from Hierarchy
Run Code Online (Sandbox Code Playgroud)
你能帮助在 SPARK SQL 中实现同样的目标吗
我有一个Vagrant图像,Spark Notebook,Spark,Accumulo 1.6和Hadoop都在运行.从笔记本,我可以手动创建一个扫描仪,并从我使用其中一个Accumulo示例创建的表中提取测试数据:
val instanceNameS = "accumulo"
val zooServersS = "localhost:2181"
val instance: Instance = new ZooKeeperInstance(instanceNameS, zooServersS)
val connector: Connector = instance.getConnector( "root", new PasswordToken("password"))
val auths = new Authorizations("exampleVis")
val scanner = connector.createScanner("batchtest1", auths)
scanner.setRange(new Range("row_0000000000", "row_0000000010"))
for(entry: Entry[Key, Value] <- scanner) {
println(entry.getKey + " is " + entry.getValue)
}
Run Code Online (Sandbox Code Playgroud)
将给出前10行表数据.
当我尝试创建RDD时:
val rdd2 =
sparkContext.newAPIHadoopRDD (
new Configuration(),
classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value]
)
Run Code Online (Sandbox Code Playgroud)
由于以下错误,我得到一个RDD返回给我,我做不了多少:
java.io.IOException:尚未设置输入信息.org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343) org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)位于org.apache.spark.rdd的org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98) .RDD $$ anonfun $ partition $ 2.apply(RDD.scala:222)at org.apache.spark.rdd.RDD $$ anonfun …
Traceback (most recent call last):
File "c:\users\rdx\anaconda3\lib\runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "c:\users\rdx\anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\RDX\Anaconda3\Scripts\ipython.exe\__main__.py", line 9, in <module>
File "c:\users\rdx\anaconda3\lib\site-packages\IPython\__init__.py", line 119, in start_ipython
return launch_new_instance(argv=argv, **kwargs)
File "c:\users\rdx\anaconda3\lib\site-packages\traitlets\config\application.py", line 657, in launch_instance
app.initialize(argv)
File "<decorator-gen-112>", line 2, in initialize
File "c:\users\rdx\anaconda3\lib\site-packages\traitlets\config\application.py", line 87, in catch_config_error
return method(app, *args, **kwargs)
File "c:\users\rdx\anaconda3\lib\site-packages\IPython\terminal\ipapp.py", line 296, in initialize
super(TerminalIPythonApp, self).initialize(argv)
File "<decorator-gen-7>", line 2, in initialize
File "c:\users\rdx\anaconda3\lib\site-packages\traitlets\config\application.py", line 87, in catch_config_error
return …Run Code Online (Sandbox Code Playgroud) Spark 笔记本安装后运行基本 df.show()
在 Spark-notebook 上运行 scala - Spark 代码时出现以下错误。知道这种情况何时发生以及如何避免吗?
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class org.apache.spark.sql.catalyst.expressions.Object on REPL class server at spark://192.168.10.194:50935/classes
[org.apache.spark.util.Utils] Aborting task
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class org on REPL class server at spark://192.168.10.194:50935/classes
[org.apache.spark.util.Utils] Aborting task
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class
Run Code Online (Sandbox Code Playgroud) 数据框显示 _c0,_c1 而不是我在第一行中的原始列名。
我想显示我的 CSV 第一行的列名。
dff =
spark.read.csv("abfss://dir@acname.dfs.core.windows.net/
diabetes.csv")
dff:pyspark.sql.dataframe.DataFrame
_c0:string
_c1:string
_c2:string
_c3:string
_c4:string
_c5:string
_c6:string
_c7:string
_c8:string
Run Code Online (Sandbox Code Playgroud) 在OSX上使用docker运行spark-notebook(通过boot2docker)似乎没有做任何事情.这是输出
pkerp@toc:~/apps/spark-notebook$ docker run -p 9000:9000 andypetrella/spark-notebook:0.1.4-spark-1.2.0-hadoop-1.0.4
Play server process ID is 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/docker/lib/spark-repl_2.10-1.2.0-notebook.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/docker/lib/ch.qos.logback.logback-classic-1.1.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/docker/lib/org.slf4j.slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/02/07 11:51:32 INFO play: Application started (Prod)
15/02/07 11:51:32 INFO play: Listening for HTTP on /0:0:0:0:0:0:0:0:9000
Run Code Online (Sandbox Code Playgroud)
当我将浏览器指向http:// localhost:9000时,它表示该网页不可用.我错过了什么吗?有什么错误的配置?
我在 Azure Databricks 中有一个 python 笔记本 A,其导入语句如下:
import xyz, datetime, ...
Run Code Online (Sandbox Code Playgroud)
我在笔记本 A 中导入了另一个笔记本 xyz,如上面的代码所示。当我运行笔记本 A 时,它抛出以下错误:
ImportError: No module named xyz
Run Code Online (Sandbox Code Playgroud)
两个笔记本都在同一个工作区目录中。任何人都可以帮助解决这个问题吗?
我在 emr 中使用 Jupyter Notebook 来处理大块数据。在处理数据时我看到这个错误:
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 108 tasks (1027.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
Run Code Online (Sandbox Code Playgroud)
看来我需要更新 Spark 配置中的 maxResultsSize 。如何从 jupyter 笔记本设置 Spark maxResultsSize。
已经检查过这篇文章:Spark 1.4增加maxResultSize内存
另外,在 emr 笔记本中,已经给出了 Spark 上下文,有什么方法可以编辑 Spark 上下文并增加 maxResultsSize
任何线索都会非常有帮助。
谢谢
我有一个要求,其中我需要将 pyspark 数据帧作为笔记本参数传递给子笔记本。本质上,子笔记本几乎没有以参数类型作为数据帧的函数来执行某些任务。现在的问题是我无法使用(不将其写入临时目录)将数据帧传递到该子笔记本
dbutils.notebook.run(<notebookpath>, timeout, <arguments>)
Run Code Online (Sandbox Code Playgroud)
我尝试引用此网址 - 从 databricks 中的另一个笔记本返回数据框
但是,我仍然有点困惑如何将数据帧从子笔记本返回到父笔记本,以及从父笔记本返回到另一个子笔记本。
我尝试编写如下代码 -
tempview_list = ["tempView1", "tempView2", "tempView3"]
for tempview in tempview_list:
dbutils.notebook.exit(spark.sql(f"Select * from {tempview}"))
Run Code Online (Sandbox Code Playgroud)
但它只是返回第一个 tempView 的架构。
请帮忙。我是 pyspark 的新手。
谢谢。
pyspark spark-notebook databricks azure-notebooks azure-databricks
spark-notebook ×12
apache-spark ×8
scala ×3
databricks ×2
pyspark ×2
accumulo ×1
amazon-emr ×1
hadoop ×1
json ×1
jupyter ×1
kernel ×1
magellan ×1
python ×1