钩入内置的python f字符串格式机制

Question

钩入内置的python f字符串格式机制

摘要

我真的很喜欢f弦。它们是令人毛骨悚然的语法。

我已经编写了一个小库（如下所述）供我自己使用，以进一步利用它们（并放在这里供其他人使用，以防对他们有帮助）。一个简单的例子：

>>> import simpleformatter as sf
>>> def format_camel_case(string):
...     """camel cases a sentence"""
...     return ''.join(s.capitalize() for s in string.split())
...
>>> @sf.formattable(camcase=format_camel_case)
... class MyStr(str): ...
...
>>> f'{MyStr("lime cordial delicious"):camcase}'
'LimeCordialDelicious'

Run Code Online (Sandbox Code Playgroud)

为了简化API的用途，并将用法扩展到内置类实例，这将非常有用，它可以找到一种方法来连接到内置的python格式设置机制，从而允许自定义内置格式的规范：

>>> f'{"lime cordial delicious":camcase}'
'LimeCordialDelicious'

Run Code Online (Sandbox Code Playgroud)

换句话说，我想重写内置format函数（由f字符串语法使用），或者扩展__format__现有标准库类的内置方法，以便我可以编写类似这个：

for x, y, z in complicated_generator:
    eat_string(f"x: {x:custom_spec1}, y: {x:custom_spec2}, z: {x:custom_spec3}")

Run Code Online (Sandbox Code Playgroud)

我通过使用自己的__format__方法创建子类来完成此操作，但是对于内置类而言，这当然不起作用。

我可以使用string.Formatterapi 接近它：

my_formatter=MyFormatter()  # custom string.Formatter instance

format_str = "x: {x:custom_spec1}, y: {x:custom_spec2}, z: {x:custom_spec3}"

for x, y, z in complicated_generator:
    eat_string(my_formatter.format(format_str, **locals()))

Run Code Online (Sandbox Code Playgroud)

与f字符串api相比，我觉得这有点笨拙，而且绝对不可读。

可以做的另一件事是覆盖builtins.format：

for x, y, z in complicated_generator:
    eat_string(f"x: {x:custom_spec1}, y: {x:custom_spec2}, z: {x:custom_spec3}")

Run Code Online (Sandbox Code Playgroud)

...但这不适用于f弦：

my_formatter=MyFormatter()  # custom string.Formatter instance

format_str = "x: {x:custom_spec1}, y: {x:custom_spec2}, z: {x:custom_spec3}"

for x, y, z in complicated_generator:
    eat_string(my_formatter.format(format_str, **locals()))

Run Code Online (Sandbox Code Playgroud)

细节

目前，我的API看起来像这样（有些简化）：

import simpleformatter as sf
@sf.formatter("this_specification")
def this_formatting_function(some_obj):
    return "this formatted someobj!"

@sf.formatter("that_specification")
def that_formatting_function(some_obj):
    return "that formatted someobj!"

@sf.formattable
class SomeClass: ...

Run Code Online (Sandbox Code Playgroud)

之后，您可以编写如下代码：

some_obj = SomeClass()
f"{some_obj:this_specification}"
f"{some_obj:that_specification}"

Run Code Online (Sandbox Code Playgroud)

我希望api如下所示：

@sf.formatter("this_specification")
def this_formatting_function(some_obj):
    return "this formatted someobj!"

@sf.formatter("that_specification")
def that_formatting_function(some_obj):
    return "that formatted someobj!"

class SomeClass: ...  # no class decorator needed

Run Code Online (Sandbox Code Playgroud)

...并允许对内置类使用自定义格式规范：

x=1  # built-in type instance
f"{x:this_specification}"
f"{x:that_specification}"

Run Code Online (Sandbox Code Playgroud)

但是为了执行这些操作，我们必须钻入内置format()函数。我该如何把握多汁的f弦优点？

Answer 1

Mic*_*lus 30

概述

你可以，但前提是你编写了可能永远不应该出现在生产软件中的邪恶代码。那么让我们开始吧！

我不会将其集成到您的库中，但我将向您展示如何挂钩 f 字符串的行为。大致是这样的：

编写一个操作代码对象的字节码指令的函数，以用FORMAT_VALUE对钩子函数的调用来替换指令；
自定义导入机制，以确保使用该函数修改每个模块和包（标准库模块和站点包除外）的字节码。

您可以在https://github.com/mivdnber/formathack获取完整源代码，但所有内容都在下面进行了解释。

免责声明

这个解决方案不太好，因为

根本无法保证这不会破坏完全不相关的代码；
无法保证此处描述的字节码操作将在较新的 Python 版本中继续工作。它肯定无法在不编译为 CPython 兼容字节码的替代 Python 实现中工作。PyPy 理论上可以工作，但这里描述的解决方案不能工作，因为字节码包不是 100% 兼容。

然而，它是一个解决方案，并且字节码操作已在PonyORM等流行包中成功使用。请记住，它很麻烦、复杂，而且可能需要大量维护。

第 1 部分：字节码操作

Python 代码不是直接执行的，而是首先编译为一种更简单的中间、非人类可读的基于堆栈的语言，称为 Python 字节码（它位于 *.pyc 文件内）。要了解字节码的样子，您可以使用标准库 dis 模块来检查简单函数的字节码：

def invalid_format(x):
    return f"{x:foo}"

Run Code Online (Sandbox Code Playgroud)

调用这个函数会导致异常，但我们很快就会“修复”这个问题。

>>> invalid_format("bar")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in invalid_format
ValueError: Invalid format specifier

Run Code Online (Sandbox Code Playgroud)

要检查字节码，请启动 Python 控制台并调用dis.dis：

>>> import dis
>>> dis.dis(invalid_format)
  2           0 LOAD_FAST                0 (x)
              2 LOAD_CONST               1 ('foo')
              4 FORMAT_VALUE             4 (with format)
              6 RETURN_VALUE

Run Code Online (Sandbox Code Playgroud)

我在下面注释了输出以解释发生的情况：

# line 2      # Put the value of function parameter x on the stack
  2           0 LOAD_FAST                0 (x)
              # Put the format spec on the stack as a string
              2 LOAD_CONST               1 ('foo')
              # Pop both values from the stack and perform the actual formatting
              # This puts the formatted string on the stack
              4 FORMAT_VALUE             4 (with format)
              # pop the result from the stack and return it
              6 RETURN_VALUE

Run Code Online (Sandbox Code Playgroud)

这里的想法是将FORMAT_VALUE指令替换为对钩子函数的调用，该函数允许我们实现我们想要的任何行为。现在让我们像这样实现它：

>>> invalid_format("bar")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in invalid_format
ValueError: Invalid format specifier

Run Code Online (Sandbox Code Playgroud)

为了替换指令，我使用了字节码包，它为做可怕的事情提供了令人惊讶的好抽象。

>>> import dis
>>> dis.dis(invalid_format)
  2           0 LOAD_FAST                0 (x)
              2 LOAD_CONST               1 ('foo')
              4 FORMAT_VALUE             4 (with format)
              6 RETURN_VALUE

Run Code Online (Sandbox Code Playgroud)

现在我们可以使invalid_format之前定义的函数起作用：

>>> invalid_format.__code__ = formathack_rewrite_bytecode__(invalid_format.__code__)
>>> invalid_format("bar")
'bar formatted with foo'

Run Code Online (Sandbox Code Playgroud)

成功！不过，用受污染的字节码手动诅咒代码对象本身并不会让我们的灵魂遭受永恒的痛苦；为此，我们应该自动操作所有代码。

第 2 部分：挂钩导入流程

为了使新的 f 字符串行为在任何地方都能工作，而不仅仅是在手动修补的函数中，我们可以使用标准库importlib模块提供的功能，通过自定义模块查找器和加载器来自定义 Python 模块导入过程：

# line 2      # Put the value of function parameter x on the stack
  2           0 LOAD_FAST                0 (x)
              # Put the format spec on the stack as a string
              2 LOAD_CONST               1 ('foo')
              # Pop both values from the stack and perform the actual formatting
              # This puts the formatted string on the stack
              4 FORMAT_VALUE             4 (with format)
              # pop the result from the stack and return it
              6 RETURN_VALUE

Run Code Online (Sandbox Code Playgroud)

为了确保 Python 解释器使用此加载器导入所有文件，我们必须将其添加到sys.meta_path：

def formathack_hook__(value, format_spec=None):
    """
    Gets called whenever a value is formatted. Right now it's a silly implementation,
    but it can be expanded with all sorts of nasty hacks.
    """
    return f"{value} formatted with {format_spec}"

Run Code Online (Sandbox Code Playgroud)

如果我们将它们全部放在一个formathack模块中（请参阅https://github.com/mivdnber/formathack以获取集成的工作示例），我们现在可以像这样使用它：

from bytecode import Bytecode
def formathack_rewrite_bytecode__(code):
    """
    Modifies a code object to override the behavior of the FORMAT_VALUE
    instructions used by f-strings.
    """
    decompiled = Bytecode.from_code(code)
    modified_instructions = []
    for instruction in decompiled:
        name = getattr(instruction, 'name', None)
        if name == 'FORMAT_VALUE':
            # 0x04 means that a format spec is present
            if instruction.arg & 0x04 == 0x04:
                callback_arg_count = 2
            else:
                callback_arg_count = 1
            modified_instructions.extend([
                # Load in the callback
                Instr("LOAD_GLOBAL", "formathack_hook__"),
                # Shuffle around the top of the stack to put the arguments on top
                # of the function global
                Instr("ROT_THREE" if callback_arg_count == 2 else "ROT_TWO"),
                # Call the callback function instead of executing FORMAT_VALUE
                Instr("CALL_FUNCTION", callback_arg_count)
            ])
        # Kind of nasty: we want to recursively alter the code of functions.
        elif name == 'LOAD_CONST' and isinstance(instruction.arg, types.CodeType):
            modified_instructions.extend([
                Instr("LOAD_CONST", formathack_rewrite_bytecode__(instruction.arg), lineno=instruction.lineno)
            ])
        else:
            modified_instructions.append(instruction)
    modified_bytecode = Bytecode(modified_instructions)
    # For functions, copy over argument definitions
    modified_bytecode.argnames = decompiled.argnames
    modified_bytecode.argcount = decompiled.argcount
    modified_bytecode.name = decompiled.name
    return modified_bytecode.to_code()

Run Code Online (Sandbox Code Playgroud)

就是这样！您可以对此进行扩展，使挂钩函数更加智能和有用（例如，通过注册处理某些格式说明符的函数）。

“它们肯定无法在 PyPy 等替代 Python 实现中工作。” 你能尝试一下吗？PyPy 似乎具有相同的字节码*格式*，至少在运行时是如此；他们的 JIT 只适用于字节码，而不能替代字节码。所以这很有可能在 PyPy 中发挥作用。 (2认同)
@MisterMiyagi 酷，我不知道！我刚刚用 PyPy 7.3.5 (3.7.10) 测试了它，它似乎失败了，因为 `dis.stack_effect` 在那里不可用。不过，“肯定行不通”是夸大其词，所以我会编辑答案。 (2认同)

归档时间：	6 年，6 月前
查看次数：	120 次
最近记录：	6 年，5 月前