不要自动放置html,head和body标签,beautifulsoup

Question

不要自动放置html,head和body标签,beautifulsoup

Ber*_*ire 29 python beautifulsoup html5lib

使用带有html5lib的beautifulsoup,它会自动放置html,head和body标签:

BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>

Run Code Online (Sandbox Code Playgroud)

我可以设置任何选项,关闭此行为？

Answer 1

unu*_*tbu 35

In [35]: import bs4 as bs

In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser")
Out[36]: <h1>FOO</h1>

Run Code Online (Sandbox Code Playgroud)

这用Python的内置HTML解析器解析HTML.引用文档:

与html5lib不同,此解析器不会尝试通过添加<body>标记来创建格式良好的HTML文档.与lxml不同,它甚至不需要添加<html>标签.

或者,您可以使用html5lib解析器,然后选择以下元素<body>:

In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib')

In [62]: soup.body.next
Out[62]: <h1>FOO</h1>

Run Code Online (Sandbox Code Playgroud)

@MartijnPieters:这似乎不是真的,至少从版本4.1.3开始.如果你未指定`features`,则默认为`['html','fast']`.略读代码,似乎`bs`使用`bs.builder.builder_registry.builders_for_feature ['html']`中列出的第一个构建器,在我的例子中它是`bs4.builder._lxml.LXMLTreeBuilder`.所以它似乎取决于你已经安装了什么.或者更确切地说,默认构建器是`bs.builder.builder_registry.lookup('html','fast')`返回的内容. (4认同)
请注意，如果体内有多个元素，此响应实际上会中断。如果您有`<h1> a </ h1> <h1> b </ h1>`，则只会返回`<h1> a </ h1>` (2认同)

Answer 2

ahu*_*igo 7

让我们首先创建一个汤样本：

soup=BeautifulSoup("<head></head><body><p>content</p></body>")

Run Code Online (Sandbox Code Playgroud)

您可以通过指定获取 html 和 body 的孩子soup.body.<tag>：

# python3: get body's first child
print(next(soup.body.children))

# if first child's tag is rss
print(soup.body.rss)

Run Code Online (Sandbox Code Playgroud)

你也可以使用unwrap()来移除 body、head 和 html

soup.html.body.unwrap()
if soup.html.select('> head'):
    soup.html.head.unwrap()
soup.html.unwrap()

Run Code Online (Sandbox Code Playgroud)

如果你加载 xml 文件，bs4.diagnose(data)会告诉你使用lxml-xml，它不会用html+body

>>> BS('<foo>xxx</foo>', 'lxml-xml')
<foo>xxx</foo>

Run Code Online (Sandbox Code Playgroud)

Answer 3

the*_*est 5

BeautifulSoup 的这方面一直让我很恼火。

这是我如何处理它：

# Parse the initial html-formatted string
soup = BeautifulSoup(html, 'lxml')

# Do stuff here

# Extract a string repr of the parse html object, without the <html> or <body> tags
html = "".join([str(x) for x in soup.body.children])

Run Code Online (Sandbox Code Playgroud)

快速分解：

# Iterator object of all tags within the <body> tag (your html before parsing)
soup.body.children

# Turn each element into a string object, rather than a BS4.Tag object
# Note: inclusive of html tags
str(x)

# Get a List of all html nodes as string objects
[str(x) for x in soup.body.children]

# Join all the string objects together to recreate your original html
"".join()

Run Code Online (Sandbox Code Playgroud)

我仍然不喜欢这个，但它完成了工作。当我使用 BS4 从 HTML 文档中过滤某些元素和/或属性，然后再对它们做其他事情时，我总是会遇到这个问题，我需要将整个对象作为字符串 repr 而不是 BS4 解析的对象返回。

希望下次我谷歌这个时，我会在这里找到我的答案。

归档时间：	12 年，10 月前
查看次数：	8379 次
最近记录：	6 年，2 月前