Iva*_*Sas 8 python screen-scraping beautifulsoup
我想使用BeautifulSoup在html中找到所有表.内表应包含在外表中.
我创建了一些有效的代码,它给出了预期的输出.但是,我不喜欢这种解决方案,因为它会摧毁'汤'对象.
你知道如何以更优雅的方式做到这一点吗?
from BeautifulSoup import BeautifulSoup as bs
input = '''<html><head><title>title</title></head>
<body>
<p>paragraph</p>
<div><div>
<table>table1<table>inner11<table>inner12</table></table></table>
<div><table>table2<table>inner2</table></table></div>
</div></div>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table>
</html>'''
soup = bs(input)
while(True):
t=soup.find("table")
if t is None:
break
print str(t)
t.decompose()
Output:
<table>table1<table>inner11<table>inner12</table></table></table>
<table>table2<table>inner2</table></table>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table>
Run Code Online (Sandbox Code Playgroud)
Wea*_*Fox 17
使用soup.findAll("table")
而不是find()
和decompose()
:
tables = soup.findAll("table")
for table in tables:
if table.findParent("table") is None:
print str(table)
Run Code Online (Sandbox Code Playgroud)
输出:
<table>table1<table>inner11<table>inner12</table></table></table>
<table>table2<table>inner2</table></table>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table>
Run Code Online (Sandbox Code Playgroud)
没有任何东西被破坏/毁坏.