Bos*_*ang 3 python apache-spark
当我运行我的 spark python 代码时,如下所示:
import pyspark
conf = (pyspark.SparkConf()
.setMaster("local")
.setAppName("My app")
.set("spark.executor.memory", "512m"))
sc = pyspark.SparkContext(conf = conf) #start the conf
data =sc.textFile('/Users/tsangbosco/Downloads/transactions')
data = data.flatMap(lambda x:x.split()).take(all)
Run Code Online (Sandbox Code Playgroud)
文件大约 20G,我的电脑有 8G 内存,当我在独立模式下运行程序时,它会引发 OutOfMemoryError:
Exception in thread "Local computation of job 12" java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:119)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:112)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.api.python.PythonRDD$$anon$1.foreach(PythonRDD.scala:112)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at org.apache.spark.api.python.PythonRDD$$anon$1.to(PythonRDD.scala:112)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at org.apache.spark.api.python.PythonRDD$$anon$1.toBuffer(PythonRDD.scala:112)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at org.apache.spark.api.python.PythonRDD$$anon$1.toArray(PythonRDD.scala:112)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:681)
at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:666)
Run Code Online (Sandbox Code Playgroud)
spark 无法处理比我的 ram 大的文件吗?你能告诉我怎么修吗?
Spark 可以处理一些情况。但是您正在使用take强制 Spark 将所有数据提取到一个数组(在内存中)。在这种情况下,您应该将它们存储到文件中,例如使用saveAsTextFile.
如果您有兴趣查看某些数据,可以使用sample或takeSample。
| 归档时间: |
|
| 查看次数: |
1685 次 |
| 最近记录: |