ale*_*cxe 15 html python beautifulsoup html-parsing
故事:
解析HTML时BeautifulSoup,class属性被视为多值属性,并以特殊方式处理:
请记住,单个标记的"class"属性可以有多个值.当您搜索与某个CSS类匹配的标记时,您将匹配其任何CSS类.
此外,作为其他树构建器类的基础HTMLTreeBuilder使用的内置引用BeautifulSoup,例如,HTMLParserTreeBuilder:
# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'. When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.
Run Code Online (Sandbox Code Playgroud)
问题:
如何配置BeautifulSoup为处理class通常的单值属性?换句话说,我不希望它class专门处理并将其视为常规属性.
仅供参考,这是其中一个有用的用例:
我尝试过的:
我实际上是通过创建自定义树构建器类并class从特殊处理的属性列表中删除它来实现的:
from bs4.builder._htmlparser import HTMLParserTreeBuilder
class MyBuilder(HTMLParserTreeBuilder):
def __init__(self):
super(MyBuilder, self).__init__()
# BeautifulSoup, please don't treat "class" specially
self.cdata_list_attributes["*"].remove("class")
soup = BeautifulSoup(data, "html.parser", builder=MyBuilder())
Run Code Online (Sandbox Code Playgroud)
在这种方法中我不喜欢的是它非常"不自然"和"神奇"涉及导入"私人"内部_htmlparser.我希望有一种更简单的方法.
注意:我想保存所有其他HTML解析相关的功能,这意味着我不想解析HTML"xml" - 只有功能(这可能是另一种解决方法).
在这种方法中我不喜欢的是它非常"不自然"和"神奇"涉及导入"私人"内部
_htmlparser.我希望有一种更简单的方法.
是的,您可以从中导入它bs4.builder:
from bs4 import BeautifulSoup
from bs4.builder import HTMLParserTreeBuilder
class MyBuilder(HTMLParserTreeBuilder):
def __init__(self):
super(MyBuilder, self).__init__()
# BeautifulSoup, please don't treat "class" as a list
self.cdata_list_attributes["*"].remove("class")
soup = BeautifulSoup(data, "html.parser", builder=MyBuilder())
Run Code Online (Sandbox Code Playgroud)
如果您不想重复自己的重要性,请将构建器放在自己的模块中,并将其注册为register_treebuilders_from()优先级.
| 归档时间: |
|
| 查看次数: |
379 次 |
| 最近记录: |