使用Python和BeautifulSoup,仅选择未包含在<a>中的文本节点

Question

使用Python和BeautifulSoup,仅选择未包含在<a>中的文本节点

我试图解析一些文本sot hat我可以urlize(包装标签)链接未格式化.这是一些示例文本:

text = '<p>This is a <a href="https://google.com">link</a>, this is also a link where the text is the same as the link: <a href="https://google.com">https://google.com</a>, and this is a link too but not formatted: https://google.com</p>'

Run Code Online (Sandbox Code Playgroud)

下面是我从迄今在这里:

from django.utils.html import urlize
from bs4 import BeautifulSoup

...

def urlize_html(text):

    soup = BeautifulSoup(text, "html.parser")

    textNodes = soup.findAll(text=True)
    for textNode in textNodes:
        urlizedText = urlize(textNode)
        textNode.replaceWith(urlizedText)

    return = str(soup)

Run Code Online (Sandbox Code Playgroud)

但是这也会捕获示例中的中间链接,导致它被双重包裹在<a>标签中.结果是这样的:

<p>This is a <a href="https://djangosnippets.org/snippets/2072/" target="_blank">link</a>, this is also a link where the test is the same as the link: <a href="https://djangosnippets.org/snippets/2072/" target="_blank">&lt;a href="https://djangosnippets.org/snippets/2072/"&gt;https://djangosnippets.org/snippets/2072/&lt;/a&gt;</a>, and this is a link too but not formatted: &lt;a href="https://djangosnippets.org/snippets/2072/"&gt;https://djangosnippets.org/snippets/2072/&lt;/a&gt;</p>

Run Code Online (Sandbox Code Playgroud)

我该怎么做才能textNodes = soup.findAll(text=True)只包含尚未包含在<a>标签中的文本节点？

Answer 1

Mar*_*ers 5

文本节点保留其parent引用,因此您只需测试a标记:

for textNode in textNodes:
    if textNode.parent and getattr(textNode.parent, 'name') == 'a':
        continue  # skip links
    urlizedText = urlize(textNode)
    textNode.replaceWith(urlizedText)

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，1 月前
查看次数：	210 次
最近记录：	10 年，1 月前