我正在使用 Google Cloud Dataproc 来做 spark 工作,我的编辑器是 Zepplin。我试图将 json 数据写入 gcp 存储桶。当我尝试 10MB 文件时,它成功了。但失败了 10GB 文件。我的 dataproc 有 1 个带有 4CPU、26GB 内存、500GB 磁盘的主服务器。5 名工人具有相同的配置。我想它应该能够处理 10GB 的数据。
我的命令是 toDatabase.repartition(10).write.json("gs://mypath")
错误是
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:224)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
at org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:528)
... 54 elided
Caused by: org.apache.spark.SparkException: …Run Code Online (Sandbox Code Playgroud) scala google-cloud-storage apache-spark google-cloud-platform google-cloud-dataproc
例如,我有一个
Array[Int] = Array(1, 1, 2, 2, 3)
数组b,
Array[Int] = Array(2, 3, 4, 5)
我想计算仅在a或b中显示的元素数。在这种情况下,它是(1,1,4,5),所以计数是4。
我尝试了diff,union,intersect,但是找不到它们的组合来获得想要的结果。