从html代码中过滤掉空的<span>标记

Question

从html代码中过滤掉空的<span>标记

我有一些HTML代码,其中有很多我想删除的行看起来像这样

<span style="position:absolute; border: black 1px solid; left:94px; top:600px; width:6px; height:10px;"></span>

Run Code Online (Sandbox Code Playgroud)

现在还有span标签,它们之间有文本,我想保留.

我想使用python re.sub函数删除那些无用的span标签.我写了这个,但它没有用

html_code_filtered = re.sub('<span*></span>', '', html_code)

Run Code Online (Sandbox Code Playgroud)

我想我在正则表达式上缺少一些东西以正确匹配线条？

Answer 1

ale*_*cxe 6

您可以使用HTML Parser BeautifulSoup来删除span没有文本的元素.

工作范例:

from bs4 import BeautifulSoup

data = """
<div>
    <span style="position:absolute; border: black 1px solid; left:94px; top:600px; width:6px; height:10px;"></span>
    <span>useful text</span>
    <span></span>
</div>
"""

soup = BeautifulSoup(data, "html.parser")

# find and remove "span" elements with empty contents
for useless in soup.find_all("span", text=lambda text: not text):
    useless.extract()

print(soup.prettify())

Run Code Online (Sandbox Code Playgroud)

打印(如您所见span,没有删除内容的元素):

<div>
 <span>
  useful text
 </span>
</div>

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，2 月前
查看次数：	895 次
最近记录：	10 年，2 月前