BeautifulSoup innerhtml？

Question

BeautifulSoup innerhtml？

Mat*_*nti 39 html python beautifulsoup innerhtml

假设我有一个页面div.我可以很容易地得到那个div soup.find().

现在我已经得到了结果,我想要打印出innerhtml它的全部内容div:我的意思是,我需要一个包含所有html标签和文本的字符串,就像我在javascript中获得的字符串一样obj.innerHTML.这可能吗？

Answer 1

TL; DR

element.encode_contents()如果你想要一个UTF-8编码的字节element.decode_contents()串,可以使用BeautifulSoup 4,如果你想要一个Python Unicode字符串,则使用它.例如,DOM的innerHTML方法可能如下所示:

def innerHTML(element):
    """Returns the inner HTML of an element as a UTF-8 encoded bytestring"""
    return element.encode_contents()

Run Code Online (Sandbox Code Playgroud)

这些函数目前不在在线文档中,因此我将引用当前函数定义和代码中的doc字符串.

`encode_contents` - 自4.0.4起

def encode_contents(
    self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
    formatter="minimal"):
    """Renders the contents of this tag as a bytestring.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param encoding: The bytestring will be in this encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

Run Code Online (Sandbox Code Playgroud)

另见格式化程序的文档 ; 您最有可能使用formatter="minimal"(默认)或formatter="html"(对于html实体),除非您想以某种方式手动处理文本.

encode_contents返回编码的字节串.如果您想要Python Unicode字符串,请decode_contents改用.

`decode_contents` - 自4.0.1起

decode_contents做同样的事情,encode_contents但返回Python Unicode字符串而不是编码的字节串.

def decode_contents(self, indent_level=None,
                   eventual_encoding=DEFAULT_OUTPUT_ENCODING,
                   formatter="minimal"):
    """Renders the contents of this tag as a Unicode string.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param eventual_encoding: The tag is destined to be
       encoded into this encoding. This method is _not_
       responsible for performing that encoding. This information
       is passed in so that it can be substituted in if the
       document contains a <META> tag that mentions the document's
       encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

Run Code Online (Sandbox Code Playgroud)

BeautifulSoup 3

BeautifulSoup 3没有上述功能,相反它有 renderContents

def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
                   prettyPrint=False, indentLevel=0):
    """Renders the contents of this tag as a string in the given
    encoding. If encoding is None, returns a Unicode string.."""

Run Code Online (Sandbox Code Playgroud)

此功能已添加回BeautifulSoup 4(在4.0.4中)以与BS3兼容.

有人知道为什么这是无证的吗？看起来这将是一个常见的用例。 (2认同)

Answer 2

Pik*_*er2 17

给定一个像一样的 BS4 soup 元素<div id="outer"><div id="inner">foobar</div></div>，这里有一些不同的方法和属性，可用于以不同的方式检索其 HTML 和文本，以及它们将返回的内容的示例。

内部HTML：

inner_html = element.encode_contents()

'<div id="inner">foobar</div>'

Run Code Online (Sandbox Code Playgroud)

外部HTML：

outer_html = str(element)

'<div id="outer"><div id="inner">foobar</div></div>'

Run Code Online (Sandbox Code Playgroud)

OuterHTML（美化）：

pretty_outer_html = element.prettify()

'''<div id="outer">
 <div id="inner">
  foobar
 </div>
</div>'''

Run Code Online (Sandbox Code Playgroud)

仅文本（使用 .text）：

element_text = element.text

'foobar'

Run Code Online (Sandbox Code Playgroud)

仅文本（使用 .string）：

element_string = element.string

'foobar'

Run Code Online (Sandbox Code Playgroud)

Answer 3

pee*_*why 11

其中一个选项可能是使用类似的东西:

 innerhtml = "".join([str(x) for x in div_element.contents])

Run Code Online (Sandbox Code Playgroud)

这还存在一些其他问题。首先，它不会转义字符串元素中的 html 实体（例如大于和小于）。其次，它会写入评论内容，但不会写入评论标签本身。 (2认同)

归档时间：	14 年，1 月前
查看次数：	26996 次
最近记录：	6 年，11 月前

BeautifulSoup innerhtml？

TL; DR

encode_contents - 自4.0.4起

decode_contents - 自4.0.1起

BeautifulSoup 3

`encode_contents` - 自4.0.4起

`decode_contents` - 自4.0.1起