使用 Java 从 S3 上的文件在 S3 上创建 zip 文件

Question

使用 Java 从 S3 上的文件在 S3 上创建 zip 文件

Pet*_*ete 5 java amazon-s3 amazon-web-services java-stream aws-sdk

我在 S3 上有很多文件，需要对其进行压缩，然后通过 S3 提供压缩文件。目前，我将它们从流压缩到本地文件，然后再次上传该文件。这会占用大量磁盘空间，因为每个文件大约有 3-10MB，而且我必须压缩多达 100.000 个文件。所以一个 zip 的容量可以超过 1TB。所以我想要一个这样的解决方案：

使用 Lambda Node 从 S3 上的文件在 S3 上创建 zip 文件

这里可以看出，zip 是直接在 S3 上创建的，而不占用本地磁盘空间。但我只是不够聪明，无法将上述解决方案转移到Java中。我还发现有关 java aws sdk 的冲突信息，称他们计划在 2017 年更改流行为。

不确定这是否有帮助，但这就是我到目前为止所做的（Upload是我保存 S3 信息的本地模型）。我刚刚删除了日志记录和其他内容以提高可读性。我认为我不会占用下载空间，将 InputStream 直接“管道”到 zip 中。但就像我说的，我也想避免使用本地 zip 文件并直接在 S3 上创建它。然而，这可能需要使用 S3 作为目标而不是 FileOutputStream 创建 ZipOutputStream。不知道如何做到这一点。

public File zipUploadsToNewTemp(List<Upload> uploads) {
    List<String> names = new ArrayList<>();

    byte[] buffer = new byte[1024];
    File tempZipFile;
    try {
      tempZipFile = File.createTempFile(UUID.randomUUID().toString(), ".zip");
    } catch (Exception e) {
      throw new ApiException(e, BaseErrorCode.FILE_ERROR, "Could not create Zip file");
    }
    try (
        FileOutputStream fileOutputStream = new FileOutputStream(tempZipFile);
        ZipOutputStream zipOutputStream = new ZipOutputStream(fileOutputStream)) {

      for (Upload upload : uploads) {
        InputStream inputStream = getStreamFromS3(upload);
        ZipEntry zipEntry = new ZipEntry(upload.getFileName());
        zipOutputStream.putNextEntry(zipEntry);
        writeStreamToZip(buffer, zipOutputStream, inputStream);
        inputStream.close();
      }
      zipOutputStream.closeEntry();
      zipOutputStream.close();
      return tempZipFile;
    } catch (IOException e) {
      logError(type, e);
      if (tempZipFile.exists()) {
        FileUtils.delete(tempZipFile);
      }
      throw new ApiException(e, BaseErrorCode.IO_ERROR,
          "Error zipping files: " + e.getMessage());
    }
}

  // I am not even sure, but I think this takes up memory and not disk space
private InputStream getStreamFromS3(Upload upload) {
    try {
      String filename = upload.getId() + "." + upload.getFileType();
      InputStream inputStream = s3FileService
          .getObject(upload.getBucketName(), filename, upload.getPath());
      return inputStream;
    } catch (ApiException e) {
      throw e;
    } catch (Exception e) {
      logError(type, e);
      throw new ApiException(e, BaseErrorCode.UNKOWN_ERROR,
          "Unkown Error communicating with S3 for file: " + upload.getFileName());
    }
}


private void writeStreamToZip(byte[] buffer, ZipOutputStream zipOutputStream,
      InputStream inputStream) {
    try {
      int len;
      while ((len = inputStream.read(buffer)) > 0) {
        zipOutputStream.write(buffer, 0, len);
      }
    } catch (IOException e) {
      throw new ApiException(e, BaseErrorCode.IO_ERROR, "Could not write stream to zip");
    }
}

Run Code Online (Sandbox Code Playgroud)

最后上传源代码。输入流是从临时 Zip 文件创建的。

public PutObjectResult upload(InputStream inputStream, String bucketName, String filename, String folder) {
    String uploadKey = StringUtils.isEmpty(folder) ? "" : (folder + "/");
    uploadKey += filename;

    ObjectMetadata metaData = new ObjectMetadata();

    byte[] bytes;
    try {
      bytes = IOUtils.toByteArray(inputStream);
    } catch (IOException e) {
      throw new ApiException(e, BaseErrorCode.IO_ERROR, e.getMessage());
    }
    metaData.setContentLength(bytes.length);
    ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(bytes);

    PutObjectRequest putObjectRequest = new PutObjectRequest(bucketPrefix + bucketName, uploadKey, byteArrayInputStream, metaData);
    putObjectRequest.setCannedAcl(CannedAccessControlList.PublicRead);

    try {
      return getS3Client().putObject(putObjectRequest);
    } catch (SdkClientException se) {
      throw s3Exception(se);
    } finally {
      IOUtils.closeQuietly(inputStream);
    }
  }

Run Code Online (Sandbox Code Playgroud)

刚刚发现一个与我需要的类似的问题也没有答案：

使用 AWS S3 Java 将 ZipOutputStream 上传到 S3，而不将 zip 文件（大）临时保存到磁盘

Answer 1

Joh*_*ein 0

我建议使用Amazon EC2 实例（低至 1c/小时，或者您甚至可以使用 Spot 实例以更低的价格获得）。较小的实例类型成本较低，但带宽有限，因此请调整大小以获得您喜欢的性能。

然后编写一个脚本来循环遍历文件：

下载
压缩
上传
删除本地文件

所有的 zip 魔法都发生在本地磁盘上。无需使用流。只需使用 Amazon S3download_file()和upload_file()通话即可。

如果 EC2 实例与 Amazon S3 位于同一区域，则无需支付数据传输费用。

归档时间：	6 年，8 月前
查看次数：	11902 次
最近记录：	4 年，11 月前