使用Python和BeautifulSoup,仅选择未包含在<a>中的文本节点

43T*_*cts 2 python beautifulsoup

我试图解析一些文本sot hat我可以urlize(包装标签)链接未格式化.这是一些示例文本:

text = '<p>This is a <a href="https://google.com">link</a>, this is also a link where the text is the same as the link: <a href="https://google.com">https://google.com</a>, and this is a link too but not formatted: https://google.com</p>'
Run Code Online (Sandbox Code Playgroud)

下面是我从迄今在这里:

from django.utils.html import urlize
from bs4 import BeautifulSoup

...

def urlize_html(text):

    soup = BeautifulSoup(text, "html.parser")

    textNodes = soup.findAll(text=True)
    for textNode in textNodes:
        urlizedText = urlize(textNode)
        textNode.replaceWith(urlizedText)

    return = str(soup)
Run Code Online (Sandbox Code Playgroud)

但是这也会捕获示例中的中间链接,导致它被双重包裹在<a>标签中.结果是这样的:

<p>This is a <a href="https://djangosnippets.org/snippets/2072/" target="_blank">link</a>, this is also a link where the test is the same as the link: <a href="https://djangosnippets.org/snippets/2072/" target="_blank">&lt;a href="https://djangosnippets.org/snippets/2072/"&gt;https://djangosnippets.org/snippets/2072/&lt;/a&gt;</a>, and this is a link too but not formatted: &lt;a href="https://djangosnippets.org/snippets/2072/"&gt;https://djangosnippets.org/snippets/2072/&lt;/a&gt;</p>
Run Code Online (Sandbox Code Playgroud)

我该怎么做才能textNodes = soup.findAll(text=True)只包含尚未包含在<a>标签中的文本节点?

Mar*_*ers 5

文本节点保留其parent引用,因此您只需测试a标记:

for textNode in textNodes:
    if textNode.parent and getattr(textNode.parent, 'name') == 'a':
        continue  # skip links
    urlizedText = urlize(textNode)
    textNode.replaceWith(urlizedText)
Run Code Online (Sandbox Code Playgroud)