我试图去掉所有的div.
输入:
<p>111</p>
<div class="1334">bla</div>
<p>333</p>
<p>333</p>
<div some unkown stuff>bla2</div>
Run Code Online (Sandbox Code Playgroud)
期望的输出:
<p>111</p>
<p>333</p>
<p>333</p>
Run Code Online (Sandbox Code Playgroud)
我试过这个,但它不起作用:
release_content = re.sub("/<div>.*<\/div>/s", "", release_content)
Run Code Online (Sandbox Code Playgroud)
不要使用正则表达式来解决这个问题.使用html解析器.这是一个使用BeautifulSoup的python解决方案:
from BeautifulSoup import BeautifulSoup
with open('Path/to/file', 'r') as content_file:
content = content_file.read()
soup = BeautifulSoup(content)
[div.extract() for div in soup.findAll('div')]
with open('Path/to/file.modified', 'w') as output_file:
output_file.write(str(soup))
Run Code Online (Sandbox Code Playgroud)