在 Python 3 中使用带有字节的 textwrap.dedent()

nom*_*ype 5 python indentation literals python-3.x python-unicode

当我在 Python 中使用三引号多行字符串时,我倾向于使用 textwrap.dedent 来保持代码可读性,并具有良好的缩进:

some_string = textwrap.dedent("""
    First line
    Second line
    ...
    """).strip()
Run Code Online (Sandbox Code Playgroud)

但是,在 Python 3.x 中, textwrap.dedent 似乎不适用于字节字符串。我在为返回长多行字节字符串的方法编写单元测试时遇到了这个问题,例如:

# The function to be tested

def some_function():
    return b'Lorem ipsum dolor sit amet\n  consectetuer adipiscing elit'

# Unit test

import unittest
import textwrap

class SomeTest(unittest.TestCase):
    def test_some_function(self):
        self.assertEqual(some_function(), textwrap.dedent(b"""
            Lorem ipsum dolor sit amet
              consectetuer adipiscing elit
            """).strip())

if __name__ == '__main__':
    unittest.main()
Run Code Online (Sandbox Code Playgroud)

在 Python 2.7.10 中,上述代码工作正常,但在 Python 3.4.3 中失败:

some_string = textwrap.dedent("""
    First line
    Second line
    ...
    """).strip()
Run Code Online (Sandbox Code Playgroud)

那么:是否有替代 textwrap.dedent 的方法可以处理字节字符串?

  • 我可以自己编写这样的函数,但如果有现有函数,我更愿意使用它。
  • 我可以转换为 unicode,使用 textwrap.dedent,然后转换回字节。但这只有在字节字符串符合某些 Unicode 编码时才可行。

Ter*_*edy 5

答案2:textwrap主要是关于Textwrap类和函数。 dedent列在下面

# -- Loosely related functionality --------------------
Run Code Online (Sandbox Code Playgroud)

据我所知,唯一使它成为文本(unicode str)特定的东西是 re 文字。我为所有 6 加上前缀b,瞧!(我没有编辑任何其他内容,但应该调整函数文档字符串。)

import re

_whitespace_only_re = re.compile(b'^[ \t]+$', re.MULTILINE)
_leading_whitespace_re = re.compile(b'(^[ \t]*)(?:[^ \t\n])', re.MULTILINE)

def dedent_bytes(text):
    """Remove any common leading whitespace from every line in `text`.

    This can be used to make triple-quoted strings line up with the left
    edge of the display, while still presenting them in the source code
    in indented form.

    Note that tabs and spaces are both treated as whitespace, but they
    are not equal: the lines "  hello" and "\\thello" are
    considered to have no common leading whitespace.  (This behaviour is
    new in Python 2.5; older versions of this module incorrectly
    expanded tabs before searching for common leading whitespace.)
    """
    # Look for the longest leading string of spaces and tabs common to
    # all lines.
    margin = None
    text = _whitespace_only_re.sub(b'', text)
    indents = _leading_whitespace_re.findall(text)
    for indent in indents:
        if margin is None:
            margin = indent

        # Current line more deeply indented than previous winner:
        # no change (previous winner is still on top).
        elif indent.startswith(margin):
            pass

        # Current line consistent with and no deeper than previous winner:
        # it's the new winner.
        elif margin.startswith(indent):
            margin = indent

        # Find the largest common whitespace between current line
        # and previous winner.
        else:
            for i, (x, y) in enumerate(zip(margin, indent)):
                if x != y:
                    margin = margin[:i]
                    break
            else:
                margin = margin[:len(indent)]

    # sanity check (testing/debugging only)
    if 0 and margin:
        for line in text.split(b"\n"):
            assert not line or line.startswith(margin), \
                   "line = %r, margin = %r" % (line, margin)

    if margin:
        text = re.sub(rb'(?m)^' + margin, b'', text)
    return text

print(dedent_bytes(b"""
            Lorem ipsum dolor sit amet
              consectetuer adipiscing elit
            """)
      )

# prints
b'\nLorem ipsum dolor sit amet\n  consectetuer adipiscing elit\n'
Run Code Online (Sandbox Code Playgroud)


wim*_*wim 3

遗憾的是,它似乎dedent不支持字节串。但是,如果您想要交叉兼容的代码,我建议您利用该six库:

import sys, unittest
from textwrap import dedent

import six


def some_function():
    return b'Lorem ipsum dolor sit amet\n  consectetuer adipiscing elit'


class SomeTest(unittest.TestCase):
    def test_some_function(self):
        actual = some_function()

        expected = six.b(dedent("""
            Lorem ipsum dolor sit amet
              consectetuer adipiscing elit
            """)).strip()

        self.assertEqual(actual, expected)

if __name__ == '__main__':
    unittest.main()
Run Code Online (Sandbox Code Playgroud)

这与您在问题中的要点建议类似

我可以转换为 unicode,使用 textwrap.dedent,然后转换回字节。但这仅在字节字符串符合某种 Unicode 编码时才可行。

但是您在这里误解了有关编码的一些内容 - 如果您可以像这样在测试中首先编写字符串文字,并且让 python 成功解析文件(即正确的编码声明位于模块上),那么就有这里没有“转换为unicode”步骤。该文件以指定的编码(或者sys.defaultencoding,如果您没有指定)进行解析,然后当字符串是 python 变量时,它已经被解码。