小编www*_*wan的帖子

org.apache.spark.SparkException:由于阶段失败而中止作业:阶段 11.0 中的任务 98 失败了 4 次

我正在使用 Google Cloud Dataproc 来做 spark 工作,我的编辑器是 Zepplin。我试图将 json 数据写入 gcp 存储桶。当我尝试 10MB 文件时,它成功了。但失败了 10GB 文件。我的 dataproc 有 1 个带有 4CPU、26GB 内存、500GB 磁盘的主服务器。5 名工人具有相同的配置。我想它应该能够处理 10GB 的数据。

我的命令是 toDatabase.repartition(10).write.json("gs://mypath")

错误是

org.apache.spark.SparkException: Job aborted.
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:224)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
  at org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:528)
  ... 54 elided
Caused by: org.apache.spark.SparkException: …
Run Code Online (Sandbox Code Playgroud)

scala google-cloud-storage apache-spark google-cloud-platform google-cloud-dataproc

7
推荐指数
1
解决办法
2万
查看次数

在scala中,如何获取两个数组中从未显示的元素数?

例如,我有一个 Array[Int] = Array(1, 1, 2, 2, 3) 数组b, Array[Int] = Array(2, 3, 4, 5) 我想计算仅在a或b中显示的元素数。在这种情况下,它是(1,1,4,5),所以计数是4。

我尝试了diff,union,intersect,但是找不到它们的组合来获得想要的结果。

arrays diff scala dataframe intersect

3
推荐指数
1
解决办法
47
查看次数