Using Python textwrap.shorten for string but with bytes width

Acu*_*nus 6 python word-wrap

I'd like to shorten a string using textwrap.shorten or a function like it. The string can potentially have non-ASCII characters. What's special here is that the maximal width is for the bytes encoding of the string. This problem is motivated by the fact that several database column definitions and some message buses have a bytes based max length.

For example:

>>> import textwrap
>>> s = '? Ilsa, le méchant ? ? gardien ?'

# Available function that I tried:
>>> textwrap.shorten(s, width=27)
'? Ilsa, le méchant ? [...]'
>>> len(_.encode())
31  # I want ?27

# Desired function:
>>> shorten_to_bytes_width(s, width=27)
'? Ilsa, le méchant [...]'
>>> len(_.encode())
27  # I want and get ?27
Run Code Online (Sandbox Code Playgroud)

It's okay for the implementation to use a width greater than or equal to the length of the whitespace-stripped placeholder [...], i.e. 5.

The text should not be shortened any more than necessary. Some buggy implementations can use optimizations which on occasion result in excessive shortening.

Using textwrap.wrap with bytes count is a similar question but it's different enough from this one since it is about textwrap.wrap, not textwrap.shorten. Only the latter function uses a placeholder ([...]) which makes this question sufficiently unique.

Caution: Do not rely on any of the answers here for shortening a JSON encoded string in a fixed number of bytes. For it, substitute text.encode() with json.dumps(text).

MSe*_*ert 3

理论上它足以满足encode您的字符串,然后检查它是否符合“宽度”约束。如果是,则可以简单地返回该字符串。否则,您可以从编码字符串中获取第一个“宽度”字节(减去占位符所需的字节)。为了确保它的工作方式,textwrap.shorten还需要找到剩余字节中的最后一个空白,并返回空白+占位符之前的所有内容。如果没有空格,则只需返回占位符。

\n\n

鉴于您提到您确实希望它受到字节量限制,如果占位符太大,函数会抛出异常。因为拥有不适合字节约束容器/数据结构的占位符根本没有意义,并且避免了许多可能导致“最大字节大小”和“占位符字节大小”不一致的边缘情况。

\n\n

代码可能如下所示:

\n\n
def shorten_rsplit(string: str, maximum_bytes: int, normalize_spaces: bool = False, placeholder: str = "[...]") -> str:\n    # Make sure the placeholder satisfies the byte length requirement\n    encoded_placeholder = placeholder.encode().strip()\n    if maximum_bytes < len(encoded_placeholder):\n        raise ValueError(\'placeholder too large for max width\')\n\n    # Get the UTF-8 bytes that represent the string and (optionally) normalize the spaces.    \n    if normalize_spaces:\n        string = " ".join(string.split())\n    encoded_string = string.encode()\n\n    # If the input string is empty simply return an empty string.\n    if not encoded_string:\n        return \'\'\n\n    # In case we don\'t need to shorten anything simply return\n    if len(encoded_string) <= maximum_bytes:\n        return string\n\n    # We need to shorten the string, so we need to add the placeholder\n    substring = encoded_string[:maximum_bytes - len(encoded_placeholder)]\n    splitted = substring.rsplit(b\' \', 1)  # Split at last space-character\n    if len(splitted) == 2:\n        return b" ".join([splitted[0], encoded_placeholder]).decode()\n    else:\n        return \'[...]\'\n
Run Code Online (Sandbox Code Playgroud)\n\n

和一个简单的测试用例:

\n\n
t = \'\xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba \xe2\x98\xba gardien \xe2\x98\xba\'\n\nfor i in range(5, 50):\n    shortened = shorten_rsplit(t, i)\n    byte_length = len(shortened.encode())\n    print(byte_length <= i, i, byte_length, shortened)\n
Run Code Online (Sandbox Code Playgroud)\n\n

哪个返回

\n\n
True 5 5 [...]\nTrue 6 5 [...]\nTrue 7 5 [...]\nTrue 8 5 [...]\nTrue 9 9 \xe2\x98\xba [...]\nTrue 10 9 \xe2\x98\xba [...]\nTrue 11 9 \xe2\x98\xba [...]\nTrue 12 9 \xe2\x98\xba [...]\nTrue 13 9 \xe2\x98\xba [...]\nTrue 14 9 \xe2\x98\xba [...]\nTrue 15 15 \xe2\x98\xba Ilsa, [...]\nTrue 16 15 \xe2\x98\xba Ilsa, [...]\nTrue 17 15 \xe2\x98\xba Ilsa, [...]\nTrue 18 18 \xe2\x98\xba Ilsa, le [...]\nTrue 19 18 \xe2\x98\xba Ilsa, le [...]\nTrue 20 18 \xe2\x98\xba Ilsa, le [...]\nTrue 21 18 \xe2\x98\xba Ilsa, le [...]\nTrue 22 18 \xe2\x98\xba Ilsa, le [...]\nTrue 23 18 \xe2\x98\xba Ilsa, le [...]\nTrue 24 18 \xe2\x98\xba Ilsa, le [...]\nTrue 25 18 \xe2\x98\xba Ilsa, le [...]\nTrue 26 18 \xe2\x98\xba Ilsa, le [...]\nTrue 27 27 \xe2\x98\xba Ilsa, le m\xc3\xa9chant [...]\nTrue 28 27 \xe2\x98\xba Ilsa, le m\xc3\xa9chant [...]\nTrue 29 27 \xe2\x98\xba Ilsa, le m\xc3\xa9chant [...]\nTrue 30 27 \xe2\x98\xba Ilsa, le m\xc3\xa9chant [...]\nTrue 31 31 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba [...]\nTrue 32 31 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba [...]\nTrue 33 31 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba [...]\nTrue 34 31 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba [...]\nTrue 35 35 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba \xe2\x98\xba [...]\nTrue 36 35 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba \xe2\x98\xba [...]\nTrue 37 35 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba \xe2\x98\xba [...]\nTrue 38 35 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba \xe2\x98\xba [...]\nTrue 39 35 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba \xe2\x98\xba [...]\nTrue 40 35 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba \xe2\x98\xba [...]\nTrue 41 41 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba \xe2\x98\xba gardien \xe2\x98\xba\nTrue 42 41 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba \xe2\x98\xba gardien \xe2\x98\xba\nTrue 43 41 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba \xe2\x98\xba gardien \xe2\x98\xba\nTrue 44 41 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba \xe2\x98\xba gardien \xe2\x98\xba\nTrue 45 41 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba \xe2\x98\xba gardien \xe2\x98\xba\nTrue 46 41 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba \xe2\x98\xba gardien \xe2\x98\xba\nTrue 47 41 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba \xe2\x98\xba gardien \xe2\x98\xba\nTrue 48 41 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba \xe2\x98\xba gardien \xe2\x98\xba\nTrue 49 41 \xe2\x98\xba Ilsa, le m\xc3\xa9chant \xe2\x98\xba \xe2\x98\xba gardien \xe2\x98\xba\n
Run Code Online (Sandbox Code Playgroud)\n\n

该函数还有一个用于标准化空格的参数。如果您有不同类型的空格(换行符等)或多个连续空格,这可能会很有帮助。虽然会慢一些。

\n\n

表现

\n\n

我使用(我编写的库)进行了快速测试simple_benchmark,以确保它实际上更快。

\n\n

对于基准测试,我创建了一个包含随机 unicode 字符的字符串,其中(平均)八个字符中有一个是空格。我还使用字符串长度的一半作为字节宽度来分割。两者都没有特殊原因,但它可能会使基准产生偏差,这就是我想提及它的原因。\n在此输入图像描述

\n\n

基准测试中使用的函数:

\n\n
def shorten_rsplit(string: str, maximum_bytes: int, normalize_spaces: bool = False, placeholder: str = "[...]") -> str:\n    encoded_placeholder = placeholder.encode().strip()\n    if maximum_bytes < len(encoded_placeholder):\n        raise ValueError(\'placeholder too large for max width\')\n    if normalize_spaces:\n        string = " ".join(string.split())\n    encoded_string = string.encode()\n    if not encoded_string:\n        return \'\'\n    if len(encoded_string) <= maximum_bytes:\n        return string\n    substring = encoded_string[:maximum_bytes - len(encoded_placeholder)]\n    splitted = substring.rsplit(b\' \', 1)  # Split at last space-character\n    if len(splitted) == 2:\n        return b" ".join([splitted[0], encoded_placeholder]).decode()\n    else:\n        return \'[...]\'\n\nimport textwrap\n\n_MIN_WIDTH = 5\ndef shorten_to_bytes_width(text: str, width: int) -> str:\n    width = max(_MIN_WIDTH, width)\n    text = textwrap.shorten(text, width)\n    while len(text.encode()) > width:\n        text = textwrap.shorten(text, len(text) - 1)\n    assert len(text.encode()) <= width\n    return text\n\ndef naive(text: str, width: int) -> str:\n    width = max(_MIN_WIDTH, width)\n    text = textwrap.shorten(text, width)\n    if len(text.encode()) <= width:\n        return text\n\n    current_width = _MIN_WIDTH\n    index = 0\n    slice_index = 0\n    endings = \' \'\n    while True:\n        new_width = current_width + len(text[index].encode())\n        if new_width > width:\n            break\n        if text[index] in endings:\n            slice_index = index\n        index += 1\n        current_width = new_width\n    if slice_index:\n        slice_index += 1  # to include found space\n    text = text[:slice_index] + \'[...]\'\n    assert len(text.encode()) <= width\n    return text\n\n\nMAX_BYTES_PER_CHAR = 4\ndef bytes_to_char_length(input, bytes, start=0, max_length=None):\n    if bytes <= 0 or (max_length is not None and max_length <= 0):\n        return 0\n    if max_length is None:\n        max_length = min(bytes, len(input) - start)\n    bytes_too_much = len(input[start:start + max_length].encode()) - bytes\n    if bytes_too_much <= 0:\n        return max_length\n    min_length = max(max_length - bytes_too_much, bytes // MAX_BYTES_PER_CHAR)\n    max_length -= (bytes_too_much + MAX_BYTES_PER_CHAR - 1) // MAX_BYTES_PER_CHAR\n    new_start = start + min_length\n    bytes_left = bytes - len(input[start:new_start].encode())\n    return min_length + bytes_to_char_length(input, bytes_left, new_start, max_length - min_length)\n\n\ndef shorten_to_bytes(input, bytes, placeholder=\' [...]\', start=0):\n    if len(input[start:start + bytes + 1].encode()) <= bytes:\n        return input\n    bytes -= len(placeholder.encode())\n    max_chars = bytes_to_char_length(input, bytes, start)\n    if max_chars <= 0:\n        return placeholder.strip() if bytes >= 0 else \'\'\n    w = input.rfind(\' \', start, start + max_chars + 1)\n    if w > 0:\n        return input[start:w] + placeholder\n    else:\n        return input[start:start + max_chars] + placeholder\n\n# Benchmark\n\nfrom simple_benchmark import benchmark, MultiArgument\n\nimport random\n\ndef get_random_unicode(length):  # https://stackoverflow.com/a/21666621/5393381\n    get_char = chr\n    include_ranges = [\n        (0x0021, 0x0021), (0x0023, 0x0026), (0x0028, 0x007E), (0x00A1, 0x00AC), (0x00AE, 0x00FF), \n        (0x0100, 0x017F), (0x0180, 0x024F), (0x2C60, 0x2C7F), (0x16A0, 0x16F0), (0x0370, 0x0377), \n        (0x037A, 0x037E), (0x0384, 0x038A), (0x038C, 0x038C)\n    ]\n\n    alphabet = [\n        get_char(code_point) for current_range in include_ranges\n            for code_point in range(current_range[0], current_range[1] + 1)\n    ]\n    # Add more whitespaces\n    for _ in range(len(alphabet) // 8):\n        alphabet.append(\' \')\n    return \'\'.join(random.choice(alphabet) for i in range(length))\n\nr = benchmark(\n    [shorten_rsplit, shorten_to_bytes, shorten_to_bytes_width, naive, bytes_to_char_length],\n    {2**exponent: MultiArgument([get_random_unicode(2**exponent), 2**exponent // 2]) for exponent in range(4, 15)},\n    "string length"\n)\n
Run Code Online (Sandbox Code Playgroud)\n\n

我还做了第二个基准测试,排除了该shorten_to_bytes_width函数,这样我就可以对更长的字符串进行基准测试:

\n\n
r = benchmark(\n    [shorten_rsplit, shorten_to_bytes, naive],\n    {2**exponent: MultiArgument([get_random_unicode(2**exponent), 2**exponent // 2]) for exponent in range(4, 20)},\n    "string length"\n)\n
Run Code Online (Sandbox Code Playgroud)\n\n

在此输入图像描述

\n