Roa*_*ame 1 python beautifulsoup python-2.7
如何使用python从HTML中删除"表"?
我有这样的情况:
paragraph = '''
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br /><br />
<table>
<tr>
<td>
text title
</td>
<td>
text title 2
</td>
</tr>
</table>
<p> lorem ipsum</p>
'''
Run Code Online (Sandbox Code Playgroud)
如何使用python删除上面的表结构内容?我希望产生的输出如下:
paragraph = '''
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br /><br />
<p> lorem ipsum</p>
'''
Run Code Online (Sandbox Code Playgroud)
你可以BeautifulSoup特别使用PageElement.extract()
In [16]: from bs4 import BeautifulSoup
In [17]: soup = BeautifulSoup("""<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br /><br />
....: <table>
....: <tr>
....: <td>
....: text title or some
....: </td>
....: </tr>
....: </table>
....: <p> lorem ipsum</p>""")
In [18]: _ = soup.table.extract()
In [19]: soup
Out[19]:
<html><body><p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br/><br/>
</p>
<p> lorem ipsum</p></body></html>
Run Code Online (Sandbox Code Playgroud)