我的问题与以下内容有些相关:从Python中的字符串中删除HTML
我正在寻找一种从文本中删除HTML代码的简单方法.例如:
string = 'foo <SOME_VALID_HTML_TAG> something </SOME_VALID_HTML_TAG> bar'
stripIt(string)
Run Code Online (Sandbox Code Playgroud)
然后会屈服foo bar.
有什么简单的工具可以在Python中实现这一点吗?HTML代码可以嵌套.
import lxml.html
import re
def stripIt(s):
doc = lxml.html.fromstring(s) # parse html string
txt = doc.xpath('text()') # ['foo ', ' bar']
txt = ' '.join(txt) # 'foo bar'
return re.sub('\s+', ' ', txt) # 'foo bar'
s = 'foo <SOME_VALID_HTML_TAG> something </SOME_VALID_HTML_TAG> bar'
stripIt(s)
Run Code Online (Sandbox Code Playgroud)
回报
foo bar
Run Code Online (Sandbox Code Playgroud)
from BeautifulSoup import BeautifulSoup
def removeTags(html, *tags):
soup = BeautifulSoup(html)
for tag in tags:
for tag in soup.findAll(tag):
tag.replaceWith("")
return soup
testhtml = '''
<html>
<head>
<title>Page title</title>
</head>
<body>text here<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>'''
print removeTags(testhtml, 'b', 'p')
Run Code Online (Sandbox Code Playgroud)