Beautiful Soup - 获取所有文本，但保留链接 html？

Question

Beautiful Soup - 获取所有文本，但保留链接 html？

waf*_*ffl 6 html python parsing beautifulsoup

我必须将大量极其混乱的 HTML 档案处理成 Markdown，其中充满了无关的表格、跨度和内联样式。

我正在尝试使用Beautiful Soup来完成此任务，我的目标基本上是函数的输出get_text()，除了href完整保留锚标记之外。

举个例子，我想转换：

<td>
    <font><span>Hello</span><span>World</span></font><br>
    <span>Foo Bar <span>Baz</span></span><br>
    <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span>
</td>

Run Code Online (Sandbox Code Playgroud)

进入：

Hello World
Foo Bar Baz
Example Link: <a href="https://google.com">Google</a>

Run Code Online (Sandbox Code Playgroud)

到目前为止，我的思维过程是简单地获取所有标签，如果它们不是锚点，则将它们全部展开，但这会导致文本重复多次，因为soup.find_all(True)返回递归嵌套标签作为单独的元素：

#!/usr/bin/env python

from bs4 import BeautifulSoup

example_html = '<td><font><span>Hello</span><span>World</span></font><br><span>Foo Bar <span>Baz</span></span><br><span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span></td>'

soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(True)

for tag in tags:
    if (tag.name == 'a'):
        print("<a href='{}'>{}</a>".format(tag['href'], tag.get_text()))
    else:
        print(tag.get_text())

Run Code Online (Sandbox Code Playgroud)

当解析器沿着树向下移动时，它会返回多个片段/重复项：

HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorld
Hello
World

Foo Bar Baz
Baz

Example Link: Google
<a href='https://google.com'>Google</a>

Run Code Online (Sandbox Code Playgroud)

Answer 1

ale*_*cxe 6

解决此问题的可能方法之一是a在打印元素的文本时对元素进行一些特殊处理。

您可以通过重写_all_strings()方法并返回后代元素的字符串表示形式a并跳过元素内的可导航字符串来完成此操作a。沿着这些思路：

from bs4 import BeautifulSoup, NavigableString, CData, Tag


class MyBeautifulSoup(BeautifulSoup):
    def _all_strings(self, strip=False, types=(NavigableString, CData)):
        for descendant in self.descendants:
            # return "a" string representation if we encounter it
            if isinstance(descendant, Tag) and descendant.name == 'a':
                yield str(descendant)

            # skip an inner text node inside "a"
            if isinstance(descendant, NavigableString) and descendant.parent.name == 'a':
                continue

            # default behavior
            if (
                (types is None and not isinstance(descendant, NavigableString))
                or
                (types is not None and type(descendant) not in types)):
                continue

            if strip:
                descendant = descendant.strip()
                if len(descendant) == 0:
                    continue
            yield descendant

Run Code Online (Sandbox Code Playgroud)

演示：

In [1]: data = """
   ...: <td>
   ...:     <font><span>Hello</span><span>World</span></font><br>
   ...:     <span>Foo Bar <span>Baz</span></span><br>
   ...:     <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;tex
   ...: t-decoration: underline;">Google</a></span>
   ...: </td>
   ...: """

In [2]: soup = MyBeautifulSoup(data, "lxml")

In [3]: print(soup.get_text())

HelloWorld
Foo Bar Baz
Example Link: <a href="https://google.com" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;" target="_blank">Google</a>

Run Code Online (Sandbox Code Playgroud)

不知道版本是否更新，我运行示例代码时出现错误 File "/Users/xhuang9/pros/station1/t1.py", line 20, in _all_strings (types is not None and type(descendant) not in types))：类型错误：“对象”类型的参数不可迭代 (2认同)

归档时间：	7 年，2 月前
查看次数：	2589 次
最近记录：	2 年，5 月前