从Lambda中的S3通知事件中获取非ASCII文件名

Ala*_*ack 5 utf-8 amazon-s3 python-2.7 python-unicode aws-lambda

keyAWS S3通知事件中表示文件名的字段是URL转义的.

当文件名包含空格或非ASCII字符时,这很明显.

例如,我已将以下文件名上传到S3:

my file ??????.txt
Run Code Online (Sandbox Code Playgroud)

通知收到:

{ 
  "Records": [
    "s3": {
        "object": {
            "key": u"my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt"
        }
    }
  ]
}
Run Code Online (Sandbox Code Playgroud)

我试过解码使用:

key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf-8')
Run Code Online (Sandbox Code Playgroud)

但是产量:

my file ÅÄÄλλÏ.txt
Run Code Online (Sandbox Code Playgroud)

当然,当我尝试使用Boto从S3获取文件时,我收到404错误.

Ala*_*ack 9

TL;博士

在解析它并解码为UTF-8之前,您需要将URL编码的Unicode字符串转换为字节str.

例如,对于具有文件名的S3对象my file ??????.txt:

>>> utf8_urlencoded_key = event['Records'][0]['s3']['object']['key'].encode('utf-8')
# encodes the Unicode string to utf-8 encoded [byte] string. The key shouldn't contain any non-ASCII at this point, but UTF-8 will be safer.
'my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt'

>>> key_utf8 = urllib.unquote_plus(utf8_urlencoded_key)
# the previous url-escaped UTF-8 are now converted to UTF-8 bytes
# If you passed a Unicode object to unquote_plus, you'd have got a 
# Unicode with UTF-8 encoded bytes!
'my file \xc5\x99\xc4\x9b\xc4\x85\xce\xbb\xce\xbb\xcf\x85.txt'

# Decodes key_utf-8 to a Unicode string
>>> key = key_utf8.decode('utf-8')
u'my file \u0159\u011b\u0105\u03bb\u03bb\u03c5.txt'
# Note the u prefix. The utf-8 bytes have been decoded to Unicode points.

>>> type(key)
<type 'unicode'>

>>> print(key)
my file ??????.txt
Run Code Online (Sandbox Code Playgroud)

背景

AWS已承诺更改默认编码的主要罪行 - https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/

你应该得到的错误decode()是:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)

key是Unicode.在Python 2.x中,您可以解码Unicode,即使它没有意义.在Python 2.x中解码Unicode,Python首先尝试将其编码为[byte] str,然后使用给定的编码对其进行解码.在Python 2.x中,默认编码应为ASCII,当然不能包含使用的字符.

如果您从Python获得了正确的UnicodeEncodeError,您可能已找到合适的答案.在Python 3上,您根本无法调用.decode().


Mar*_*thy 6

以防万一其他人来到这里希望获得 JavaScript 解决方案,这就是我最终得到的:

function decodeS3EventKey (key = '') {
  return decodeURIComponent(key.replace(/\+/g, ' '))
}
Run Code Online (Sandbox Code Playgroud)

通过有限的测试,它似乎工作正常:

  • test+image+%C3%BCtf+%E3%83%86%E3%82%B9%E3%83%88.jpg 解码为 test image ütf ???.jpg
  • my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt 解码为 my file ??????.txt