Spark 1.6.1 S3 MultiObjectDeleteException

Ale*_*sky 3 amazon-s3 apache-spark spark-streaming

我正在使用Spark使用S3A URI将数据写入S3.
我也在利用s3-external-1.amazonaws.com端点来避免us-east1上的read-after-write最终一致性问题.

尝试将一些数据写入S3时发生以下问题(它实际上是一个移动操作):

  com.amazonaws.services.s3.model.MultiObjectDeleteException: Status Code: 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS Error Message: One or more objects could not be deleted, S3 Extended Request ID: null
    at com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:1745)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.delete(S3AFileSystem.java:687)
    at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.cleanupJob(FileOutputCommitter.java:381)
    at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:314)
    at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
    at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
    at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
    at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
    at org.apache.spark.sql.DataFrameWriter.orc(DataFrameWriter.scala:346)
    at com.mgmg.memengine.stream.app.persistentEventStreamBootstrap$$anonfun$setupSsc$3.apply(persistentEventStreamBootstrap.scala:122)
    at com.mgmg.memengine.stream.app.persistentEventStreamBootstrap$$anonfun$setupSsc$3.apply(persistentEventStreamBootstrap.scala:112)
    at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
    at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
    at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
    at scala.util.Try$.apply(Try.scala:161)
Run Code Online (Sandbox Code Playgroud)

对象(S):

Ste*_*ran 9

这也可能是由 >1 个进程尝试删除路径的竞争条件引起的;HADOOP-14101表明了这一点。

在这种特定情况下,您应该能够通过将 hadoop 选项设置fs.s3a.multiobjectdelete.enablefalse.

更新,2017-02-23

已经为此编写了一些测试,我无法复制它以删除不存在的路径,但已经为权限问题做了它。假设这是现在的原因,但我们欢迎更多堆栈跟踪来帮助识别问题。HADOOP-11572涵盖了该问题,包括补丁、文档和更好地记录问题(即记录失败的路径和特定错误)。


spa*_*our 6

当我升级到Spark 2.0.0时遇到了这个问题,结果证明它是一个缺少的S3权限.我目前使用aws-java-sdk-1.7.4和hadoop-aws-2.7.2作为依赖项运行Spark 2.0.0.

要解决此问题,我必须将s3:Delete*Action 添加到相应的IAM策略中.根据环境的设置方式,这可能是S3存储桶上的策略,是Hadoop s3a库连接的SECRET_KEY用户的策略,或者是运行Spark的EC2实例的IAM角色策略.

就我而言,我的工作IAM角色策略现在看起来像这样:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Delete*", "s3:Get*", "s3:List*", "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::mybucketname/*"
        }
    ]
}
Run Code Online (Sandbox Code Playgroud)

这是通过S3或IAM AWS控制台的快速更改,应立即应用,无需重新启动Spark群集.如果您不确定如何编辑策略,我在此处提供了更多详细信息.