Python截断国际字符串

Question

Python截断国际字符串

Sno*_*man 6 python string encoding apple-push-notifications

我一直试图调试这个太久了,我显然不知道我在做什么,所以希望有人可以提供帮助.我甚至不确定我应该问什么,但在这里:

我正在尝试发送Apple推送通知,并且它们的有效负载大小限制为256字节.所以减去一些开销的东西,我留下了大约100个主要消息内容的英文字符.

因此,如果消息长于最大值,我会截断它:

MAX_PUSH_LENGTH = 100
body = (body[:MAX_PUSH_LENGTH]) if len(body) > MAX_PUSH_LENGTH else body

Run Code Online (Sandbox Code Playgroud)

所以这很好,花花公子,无论我有多长时间的消息(英文),推送通知发送成功.但是,现在我有一个阿拉伯字符串:

str = "??? ????? 
??? ????? ??? ??? ??? ??? ????? 
??? ????? ??? ??? ??? 
???? ?"

>>> print len(str)
109

Run Code Online (Sandbox Code Playgroud)

所以这应该截断.但是,我总是得到无效的有效负载大小错误!好奇,我一直在降低MAX_PUSH_LENGTH阈值以查看它成功需要什么,并且直到我将限制设置为大约60才推动通知成功.

我不确定这是否与英语以外的语言字节大小有关.我的理解是英文字符占用一个字节,阿拉伯字符占用2个字节也是如此？这可能与它有关吗？

此外,字符串在发送之前是JSON编码的,因此最终看起来像这样:\u0647\u064a\u0643 \u0628\u0646\u0643\u0648\u0646 \n\u0639\u064a\u0634 ...它是否被解释为原始字符串,而u0647只是5个字节？

我该怎么办？有没有明显的错误,或者我没有问正确的问题？

Answer 1

Nic*_*ick 10

如果您有一个python unicode值并且想要截断,则以下是在Python中执行此操作的非常简短,通用且高效的方法.

def truncate_unicode_to_byte_limit(src, byte_limit, encoding='utf-8'):
    '''
    truncate a unicode value to fit within byte_limit when encoded in encoding

    src: a unicode
    byte_limit: a non-negative integer
    encoding: a text encoding

    returns a unicode prefix of src guaranteed to fit within byte_limit when
    encoded as encoding.
    '''
    return src.encode(encoding)[:byte_limit].decode(encoding, 'ignore')

Run Code Online (Sandbox Code Playgroud)

例如:

s = u"""
    ??? ?????
    ascii
    ??? ????? ??? ??? ??? ??? ?????
    ??? ????? ??? ??? ???
    ???? ?
"""

b = truncate_unicode_to_byte_limit(s, 73)
print len(b.encode('utf-8')), b

Run Code Online (Sandbox Code Playgroud)

产生输出:

73 
    ??? ?????
    ascii
    ??? ????? ??? ??? ??

Run Code Online (Sandbox Code Playgroud)

Answer 2

900*_*000 1

您需要剪切到字节长度，因此您需要首先剪切.encode(\'utf-8\')字符串，然后在代码点边界处剪切它。

\n\n

在 UTF-8 中，ASCII ( <= 127) 为 1 字节。具有两个或多个最高有效位集( >= 192) 的字节是字符起始字节；接下来的字节数由最高有效位设置的数量确定。其他任何内容都是连续字节。

\n\n

如果在中间切断多字节序列，可能会出现问题；如果一个字符不适合，则应将其完全剪切，直至起始字节。

\n\n

这是一些工作代码：

\n\n

LENGTH_BY_PREFIX = [\n  (0xC0, 2), # first byte mask, total codepoint length\n  (0xE0, 3), \n  (0xF0, 4),\n  (0xF8, 5),\n  (0xFC, 6),\n]\n\ndef codepoint_length(first_byte):\n    if first_byte < 128:\n        return 1 # ASCII\n    for mask, length in LENGTH_BY_PREFIX:\n        if first_byte & mask == mask:\n            return length\n    assert False, \'Invalid byte %r\' % first_byte\n\ndef cut_to_bytes_length(unicode_text, byte_limit):\n    utf8_bytes = unicode_text.encode(\'UTF-8\')\n    cut_index = 0\n    while cut_index < len(utf8_bytes):\n        step = codepoint_length(ord(utf8_bytes[cut_index]))\n        if cut_index + step > byte_limit:\n            # can\'t go a whole codepoint further, time to cut\n            return utf8_bytes[:cut_index]\n        else:\n            cut_index += step\n    # length limit is longer than our bytes strung, so no cutting\n    return utf8_bytes\n

Run Code Online (Sandbox Code Playgroud)\n\n

现在测试一下。如果.decode()成功，我们就做出了正确的切割。

\n\n

unicode_text = u"\xd9\x87\xd9\x8a\xd9\x83 \xd8\xa8\xd9\x86\xd9\x83\xd9\x88\xd9\x86" # note that the literal here is Unicode\n\nprint cut_to_bytes_length(unicode_text, 100).decode(\'UTF-8\')\nprint cut_to_bytes_length(unicode_text, 10).decode(\'UTF-8\')\nprint cut_to_bytes_length(unicode_text, 5).decode(\'UTF-8\')\nprint cut_to_bytes_length(unicode_text, 4).decode(\'UTF-8\')\nprint cut_to_bytes_length(unicode_text, 3).decode(\'UTF-8\')\nprint cut_to_bytes_length(unicode_text, 2).decode(\'UTF-8\')\n\n# This returns empty strings, because an Arabic letter\n# requires at least 2 bytes to represent in UTF-8.\nprint cut_to_bytes_length(unicode_text, 1).decode(\'UTF-8\')\n

Run Code Online (Sandbox Code Playgroud)\n\n

您可以测试该代码是否也适用于 ASCII。

\n

归档时间：	13 年，6 月前
查看次数：	2201 次
最近记录：	11 年，11 月前