我每次都会收到这个错误......我用的是苏打水...我的conf-file:
***"spark.driver.memory 65g
spark.python.worker.memory 65g
spark.master local[*]"***
Run Code Online (Sandbox Code Playgroud)
数据量约为5 Gb.没有关于此错误的其他信息......有人知道它为什么会发生吗?谢谢!
***"ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent call last):
File "/data/analytics/Spark1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 746, in send_command
raise Py4JError("Answer from Java side is empty")
Py4JError: Answer from Java side is empty
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
File "/data/analytics/Spark1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start
self.socket.connect((self.address, self.port))
File "/usr/local/anaconda/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
ERROR:py4j.java_gateway:An error occurred while …Run Code Online (Sandbox Code Playgroud) HY,
我在一个拥有400k行和3列的sparkcontext中有一个数据帧.驱动程序有143.5的存储内存
16/03/21 19:52:35 INFO BlockManagerMasterEndpoint: Registering block manager localhost:55613 with 143.5 GB RAM, BlockManagerId(driver, localhost, 55613)
16/03/21 19:52:35 INFO BlockManagerMaster: Registered BlockManager
Run Code Online (Sandbox Code Playgroud)
我想要将此DataFrame的内容作为Pandas返回
我做到了
df_users = UserDistinct.toPandas()
Run Code Online (Sandbox Code Playgroud)
但我有这个错误
16/03/21 20:01:08 ERROR Executor: Exception in task 7.0 in stage 6.0 (TID 439)
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(Unknown Source)
at java.io.ByteArrayOutputStream.grow(Unknown Source)
at java.io.ByteArrayOutputStream.ensureCapacity(Unknown Source)
at java.io.ByteArrayOutputStream.write(Unknown Source)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(Unknown Source)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(Unknown Source)
at java.io.ObjectOutputStream.writeObject0(Unknown Source)
at java.io.ObjectOutputStream.writeObject(Unknown Source)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:239)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at …Run Code Online (Sandbox Code Playgroud) 我刚刚开始尝试pyspark / spark并遇到我的代码无法正常工作的问题。我找不到问题,spark的错误输出不是很有帮助。我确实在stackoverflow上找到了相同的问题,但没有一个明确的答案或解决方案(至少对我来说不是)。
我要运行的代码是:
import json
from datetime import datetime, timedelta
from pyspark.sql.session import SparkSession
from parse.data_reader import read_csv
from parse.interpolate import insert_time_range, create_time_range, linear_interpolate
spark = SparkSession.builder.getOrCreate()
df = None
with open('config/data_sources.json') as sources_file:
sources = json.load(sources_file)
for file in sources['files']:
with open('config/mappings/{}.json'.format(file['mapping'])) as mapping:
df_to_append = read_csv(
spark=spark,
file='{}{}'.format(sources['root_path'], file['name']),
config=json.load(mapping)
)
if df is None:
df = df_to_append
else:
df = df.union(df_to_append)
df.sort(["Timestamp", "Variable"]).show(n=5, truncate=False)
time_range = create_time_range(
datetime(year=2019, month=7, day=1, hour=0),
datetime(year=2019, month=7, day=8, hour=0),
timedelta(seconds=3600) …Run Code Online (Sandbox Code Playgroud) out-of-memory apache-spark pyspark databricks azure-databricks
我正在尝试使用 Apache Spark 来处理我的大型(约 230k 个条目)cassandra 数据集,但我不断遇到不同类型的错误。但是,在大约 200 个条目的数据集上运行时,我可以成功运行应用程序。我有 3 个节点的 spark 设置,其中包含 1 个主节点和 2 个工作线程,并且 2 个工作线程还安装了一个 cassandra 集群,其中数据索引的复制因子为 2。我的 2 个 spark 工作线程在 Web 界面上显示 2.4 GB 和 2.8 GB 内存,并且我spark.executor.memory在运行应用程序时设置为 2409,以获得 4.7 GB 的组合内存。这是我的 WebUI 主页
其中一项任务的环境页面
在这个阶段,我只是尝试使用 spark 处理存储在 cassandra 中的数据。这是我用来在 Java 中执行此操作的基本代码
SparkConf conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CASSANDRA_HOST)
.setJars(jars);
SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
SparkContextJavaFunctions context = javaFunctions(sc);
CassandraJavaRDD<CassandraRow> rdd = context.cassandraTable(CASSANDRA_KEYSPACE, CASSANDRA_COLUMN_FAMILY);
System.out.println(rdd.count());
Run Code Online (Sandbox Code Playgroud)
为了成功运行,在一个小数据集(200 个条目)上,事件界面看起来像这样
但是当我在大型数据集上运行同样的事情时(即我只更改CASSANDRA_COLUMN_FAMILY),作业永远不会在终端内终止,日志看起来像这样 …
java cassandra apache-spark apache-spark-sql spark-cassandra-connector
当我运行我的 spark python 代码时,如下所示:
import pyspark
conf = (pyspark.SparkConf()
.setMaster("local")
.setAppName("My app")
.set("spark.executor.memory", "512m"))
sc = pyspark.SparkContext(conf = conf) #start the conf
data =sc.textFile('/Users/tsangbosco/Downloads/transactions')
data = data.flatMap(lambda x:x.split()).take(all)
Run Code Online (Sandbox Code Playgroud)
文件大约 20G,我的电脑有 8G 内存,当我在独立模式下运行程序时,它会引发 OutOfMemoryError:
Exception in thread "Local computation of job 12" java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:119)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:112)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.api.python.PythonRDD$$anon$1.foreach(PythonRDD.scala:112)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at org.apache.spark.api.python.PythonRDD$$anon$1.to(PythonRDD.scala:112)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at org.apache.spark.api.python.PythonRDD$$anon$1.toBuffer(PythonRDD.scala:112)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at org.apache.spark.api.python.PythonRDD$$anon$1.toArray(PythonRDD.scala:112)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
at …Run Code Online (Sandbox Code Playgroud)