我有一个python方法(感谢这个片段),它<a>使用BeautifulSoup和Django的urlize 在一些未格式化的链接上获取一些html和包装标签:
from django.utils.html import urlize
from bs4 import BeautifulSoup
def html_urlize(self, text):
soup = BeautifulSoup(text, "html.parser")
print(soup)
textNodes = soup.findAll(text=True)
for textNode in textNodes:
if textNode.parent and getattr(textNode.parent, 'name') == 'a':
continue # skip already formatted links
urlizedText = urlize(textNode)
textNode.replaceWith(urlizedText)
print(soup)
return str(soup)
Run Code Online (Sandbox Code Playgroud)
样本输入文本(由第一个print语句输出)是这样的:
this is a formatted link <a href="http://google.ca">http://google.ca</a>, this one is unformatted and should become formatted: http://google.ca
Run Code Online (Sandbox Code Playgroud)
生成的返回文本(由第二个print语句输出)是这样的:
this is a formatted link <a href="http://google.ca">http://google.ca</a>, this one is unformatted and should …Run Code Online (Sandbox Code Playgroud)