Amazon S3中的Etag定义已更改

@DavidG扫描块大小的可行范围.虽然不是万无一失,但我发现Amazon Import/Export和'aws s3'命令行工具使用整数GB块来处理从几GB到几百GB(可能总是)的文件.如果你在连字符后面取etag的一部分,它会告诉你块的数量.这通常会留下少量可能产生大量块的块大小.您可以遍历它们以查看它是否与您给出的etag相匹配.我每次迭代都使用Antonio的脚本. (2认同)

Answer 3

Spe*_*dge 21

也在python ...

# Max size in bytes before uploading in parts. 
AWS_UPLOAD_MAX_SIZE = 20 * 1024 * 1024
# Size of parts when uploading in parts
AWS_UPLOAD_PART_SIZE = 6 * 1024 * 1024

#
# Function : md5sum
# Purpose : Get the md5 hash of a file stored in S3
# Returns : Returns the md5 hash that will match the ETag in S3
def md5sum(sourcePath):

    filesize = os.path.getsize(sourcePath)
    hash = hashlib.md5()

    if filesize > AWS_UPLOAD_MAX_SIZE:

        block_count = 0
        md5string = ""
        with open(sourcePath, "rb") as f:
            for block in iter(lambda: f.read(AWS_UPLOAD_PART_SIZE), ""):
                hash = hashlib.md5()
                hash.update(block)
                md5string = md5string + binascii.unhexlify(hash.hexdigest())
                block_count += 1

        hash = hashlib.md5()
        hash.update(md5string)
        return hash.hexdigest() + "-" + str(block_count)

    else:
        with open(sourcePath, "rb") as f:
            for block in iter(lambda: f.read(AWS_UPLOAD_PART_SIZE), ""):
                hash.update(block)
        return hash.hexdigest()

Run Code Online (Sandbox Code Playgroud)

通过更改`AWS_UPLOAD_PART_SIZE = 8*1024*1024`并通过将空字符串更改为`b"来调整python 3来为我工作 (4认同)
打开模式应为"rb"而不是"r + b",以便可以处理只读文件. (2认同)
你这样做“md5string = md5string + binascii.unhexlify(hash.hexdigest())”与“md5string = md5string + hash.digest()”有什么原因吗？ (2认同)

Answer 4

sea*_*boy 6

这是一个powershell函数来计算文件的Amazon ETag:

$blocksize = (1024*1024*5)
$startblocks = (1024*1024*16)
function AmazonEtagHashForFile($filename) {
    $lines = 0
    [byte[]] $binHash = @()

    $md5 = [Security.Cryptography.HashAlgorithm]::Create("MD5")
    $reader = [System.IO.File]::Open($filename,"OPEN","READ")

    if ((Get-Item $filename).length -gt $startblocks) {
        $buf = new-object byte[] $blocksize
        while (($read_len = $reader.Read($buf,0,$buf.length)) -ne 0){
            $lines   += 1
            $binHash += $md5.ComputeHash($buf,0,$read_len)
        }
        $binHash=$md5.ComputeHash( $binHash )
    }
    else {
        $lines   = 1
        $binHash += $md5.ComputeHash($reader)
    }

    $reader.Close()

    $hash = [System.BitConverter]::ToString( $binHash )
    $hash = $hash.Replace("-","").ToLower()

    if ($lines -gt 1) {
        $hash = $hash + "-$lines"
    }

    return $hash
}

Run Code Online (Sandbox Code Playgroud)

Answer 5

r03*_*r03 6

这是Go中的一个例子:

func GetEtag(path string, partSizeMb int) string {
    partSize := partSizeMb * 1024 * 1024
    content, _ := ioutil.ReadFile(path)
    size := len(content)
    contentToHash := content
    parts := 0

    if size > partSize {
        pos := 0
        contentToHash = make([]byte, 0)
        for size > pos {
            endpos := pos + partSize
            if endpos >= size {
                endpos = size
            }
            hash := md5.Sum(content[pos:endpos])
            contentToHash = append(contentToHash, hash[:]...)
            pos += partSize
            parts += 1
        }
    }

    hash := md5.Sum(contentToHash)
    etag := fmt.Sprintf("%x", hash)
    if parts > 0 {
        etag += fmt.Sprintf("-%d", parts)
    }
    return etag
}

Run Code Online (Sandbox Code Playgroud)

这只是一个例子,你应该处理错误和东西

Answer 6

hrr*_*hrr 5

如果您使用分段上传，“etag”不是数据的 MD5 和（请参阅计算大于 5GB 的文件的 Amazon-S3 Etag 的算法是什么？）。人们可以通过包含破折号“-”的 etag 来识别这种情况。

现在，有趣的问题是如何在不下载的情况下获得数据的实际 MD5 和？一种简单的方法是将对象“复制”到其自身上，这不需要下载：

s3cmd cp s3://bucket/key s3://bucket/key

这将导致 S3 重新计算 MD5 和并将其存储为刚刚复制的对象的“etag”。“复制”命令直接在S3上运行，即没有对象数据传输到S3或从S3传输，因此这需要很少的带宽！（注意：不要使用 s3cmd mv；这会删除您的数据。）

底层 REST 命令是：

PUT /key HTTP/1.1
Host: bucket.s3.amazonaws.com
x-amz-copy-source: /buckey/key
x-amz-metadata-directive: COPY

Run Code Online (Sandbox Code Playgroud)

是的，这个“副本”在 S3 服务器上执行（我已经更新了答案以明确提及这一点）——这就是为什么我发现这对于计算 MD5 和非常有用。 (2认同)

归档时间：	14 年，7 月前
查看次数：	47579 次
最近记录：	6 年，11 月前