相关疑难解决方法(0)

apache spark"Py4JError:来自Java端的答案是空的"

我每次都会收到这个错误......我用的是苏打水...我的conf-file:

***"spark.driver.memory 65g
spark.python.worker.memory 65g
spark.master local[*]"***
Run Code Online (Sandbox Code Playgroud)

数据量约为5 Gb.没有关于此错误的其他信息......有人知道它为什么会发生吗?谢谢!

***"ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent call last):
  File "/data/analytics/Spark1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 746, in send_command
    raise Py4JError("Answer from Java side is empty")
Py4JError: Answer from Java side is empty
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
  File "/data/analytics/Spark1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start
    self.socket.connect((self.address, self.port))
  File "/usr/local/anaconda/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
ERROR:py4j.java_gateway:An error occurred while …
Run Code Online (Sandbox Code Playgroud)

apache-spark

6
推荐指数
2
解决办法
5915
查看次数

pyspark中的java.lang.OutOfMemoryError

HY,

我在一个拥有400k行和3列的sparkcontext中有一个数据帧.驱动程序有143.5的存储内存

16/03/21 19:52:35 INFO BlockManagerMasterEndpoint: Registering block manager localhost:55613 with 143.5 GB RAM, BlockManagerId(driver, localhost, 55613)
16/03/21 19:52:35 INFO BlockManagerMaster: Registered BlockManager
Run Code Online (Sandbox Code Playgroud)

我想要将此DataFrame的内容作为Pandas返回

我做到了

df_users =  UserDistinct.toPandas()
Run Code Online (Sandbox Code Playgroud)

但我有这个错误

16/03/21 20:01:08 ERROR Executor: Exception in task 7.0 in stage 6.0 (TID 439)
java.lang.OutOfMemoryError
    at java.io.ByteArrayOutputStream.hugeCapacity(Unknown Source)
    at java.io.ByteArrayOutputStream.grow(Unknown Source)
    at java.io.ByteArrayOutputStream.ensureCapacity(Unknown Source)
    at java.io.ByteArrayOutputStream.write(Unknown Source)
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(Unknown Source)
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(Unknown Source)
    at java.io.ObjectOutputStream.writeObject0(Unknown Source)
    at java.io.ObjectOutputStream.writeObject(Unknown Source)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:239)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at …
Run Code Online (Sandbox Code Playgroud)

pandas apache-spark pyspark

5
推荐指数
2
解决办法
5488
查看次数

PySpark数据帧操作导致OutOfMemoryError

我刚刚开始尝试pyspark / spark并遇到我的代码无法正常工作的问题。我找不到问题,spark的错误输出不是很有帮助。我确实在stackoverflow上找到了相同的问题,但没有一个明确的答案或解决方案(至少对我来说不是)。

我要运行的代码是:

import json
from datetime import datetime, timedelta

from pyspark.sql.session import SparkSession

from parse.data_reader import read_csv
from parse.interpolate import insert_time_range, create_time_range, linear_interpolate

spark = SparkSession.builder.getOrCreate()

df = None
with open('config/data_sources.json') as sources_file:
    sources = json.load(sources_file)
    for file in sources['files']:
        with open('config/mappings/{}.json'.format(file['mapping'])) as mapping:
            df_to_append = read_csv(
                spark=spark,
                file='{}{}'.format(sources['root_path'], file['name']),
                config=json.load(mapping)
            )

        if df is None:
            df = df_to_append
        else:
            df = df.union(df_to_append)

df.sort(["Timestamp", "Variable"]).show(n=5, truncate=False)

time_range = create_time_range(
    datetime(year=2019, month=7, day=1, hour=0),
    datetime(year=2019, month=7, day=8, hour=0),
    timedelta(seconds=3600) …
Run Code Online (Sandbox Code Playgroud)

out-of-memory apache-spark pyspark databricks azure-databricks

5
推荐指数
1
解决办法
207
查看次数

Apache Spark 无法处理大型 Cassandra 列族

我正在尝试使用 Apache Spark 来处理我的大型(约 230k 个条目)cassandra 数据集,但我不断遇到不同类型的错误。但是,在大约 200 个条目的数据集上运行时,我可以成功运行应用程序。我有 3 个节点的 spark 设置,其中包含 1 个主节点和 2 个工作线程,并且 2 个工作线程还安装了一个 cassandra 集群,其中数据索引的复制因子为 2。我的 2 个 spark 工作线程在 Web 界面上显示 2.4 GB 和 2.8 GB 内存,并且我spark.executor.memory在运行应用程序时设置为 2409,以获得 4.7 GB 的组合内存。这是我的 WebUI 主页

在此处输入图片说明

其中一项任务的环境页面

环境

在这个阶段,我只是尝试使用 spark 处理存储在 cassandra 中的数据。这是我用来在 Java 中执行此操作的基本代码

SparkConf conf = new SparkConf(true)
        .set("spark.cassandra.connection.host", CASSANDRA_HOST)
        .setJars(jars);

SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
SparkContextJavaFunctions context = javaFunctions(sc);

CassandraJavaRDD<CassandraRow> rdd = context.cassandraTable(CASSANDRA_KEYSPACE, CASSANDRA_COLUMN_FAMILY);

System.out.println(rdd.count());
Run Code Online (Sandbox Code Playgroud)

为了成功运行,在一个小数据集(200 个条目)上,事件界面看起来像这样

在此处输入图片说明

但是当我在大型数据集上运行同样的事情时(即我只更改CASSANDRA_COLUMN_FAMILY),作业永远不会在终端内终止,日志看起来像这样 …

java cassandra apache-spark apache-spark-sql spark-cassandra-connector

4
推荐指数
1
解决办法
2351
查看次数

Spark 引发 OutOfMemoryError

当我运行我的 spark python 代码时,如下所示:

import pyspark
conf = (pyspark.SparkConf()
     .setMaster("local")
     .setAppName("My app")
     .set("spark.executor.memory", "512m"))
sc = pyspark.SparkContext(conf = conf)        #start the conf
data =sc.textFile('/Users/tsangbosco/Downloads/transactions')
data = data.flatMap(lambda x:x.split()).take(all)
Run Code Online (Sandbox Code Playgroud)

文件大约 20G,我的电脑有 8G 内存,当我在独立模式下运行程序时,它会引发 OutOfMemoryError:

Exception in thread "Local computation of job 12" java.lang.OutOfMemoryError: Java heap space
    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
    at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:119)
    at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:112)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at org.apache.spark.api.python.PythonRDD$$anon$1.foreach(PythonRDD.scala:112)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at org.apache.spark.api.python.PythonRDD$$anon$1.to(PythonRDD.scala:112)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at org.apache.spark.api.python.PythonRDD$$anon$1.toBuffer(PythonRDD.scala:112)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at org.apache.spark.api.python.PythonRDD$$anon$1.toArray(PythonRDD.scala:112)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259)
    at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
    at …
Run Code Online (Sandbox Code Playgroud)

python apache-spark

3
推荐指数
1
解决办法
1685
查看次数