禁用特殊的"类"属性处理

ale*_*cxe 15 html python beautifulsoup html-parsing

故事:

解析HTML时BeautifulSoup,class属性被视为多值属性,并以特殊方式处理:

请记住,单个标记的"class"属性可以有多个值.当您搜索与某个CSS类匹配的标记时,您将匹配其任何CSS类.

此外,作为其他树构建器类的基础HTMLTreeBuilder使用的内置引用BeautifulSoup,例如,HTMLParserTreeBuilder:

# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'.  When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.
Run Code Online (Sandbox Code Playgroud)

问题:

如何配置BeautifulSoup为处理class通常的单值属性?换句话说,我不希望它class专门处理并将其视为常规属性.

仅供参考,这是其中一个有用的用例:

我尝试过的:

我实际上是通过创建自定义树构建器类class从特殊处理的属性列表中删除它来实现的:

from bs4.builder._htmlparser import HTMLParserTreeBuilder

class MyBuilder(HTMLParserTreeBuilder):
    def __init__(self):
        super(MyBuilder, self).__init__()

        # BeautifulSoup, please don't treat "class" specially
        self.cdata_list_attributes["*"].remove("class")


soup = BeautifulSoup(data, "html.parser", builder=MyBuilder())
Run Code Online (Sandbox Code Playgroud)

在这种方法中我不喜欢的是它非常"不自然"和"神奇"涉及导入"私人"内部_htmlparser.我希望有一种更简单的方法.

注意:我想保存所有其他HTML解析相关的功能,这意味着我不想解析HTML"xml" - 只有功能(这可能是另一种解决方法).

dno*_*zay 6

在这种方法中我不喜欢的是它非常"不自然"和"神奇"涉及导入"私人"内部_htmlparser.我希望有一种更简单的方法.

是的,您可以从中导入它bs4.builder:

from bs4 import BeautifulSoup
from bs4.builder import HTMLParserTreeBuilder

class MyBuilder(HTMLParserTreeBuilder):
    def __init__(self):
        super(MyBuilder, self).__init__()
        # BeautifulSoup, please don't treat "class" as a list
        self.cdata_list_attributes["*"].remove("class")


soup = BeautifulSoup(data, "html.parser", builder=MyBuilder())
Run Code Online (Sandbox Code Playgroud)

如果您不想重复自己的重要性,请将构建器放在自己的模块中,并将其注册为register_treebuilders_from()优先级.