我需要做一些html解析使用python.如果我有像hlow的html文件:
?body?
?div class="mydiv"?
?p?i want got it?/p?
?div?
?p? good ?/p?
?a? boy ?/a?
?/div?
?/div?
?/body?
Run Code Online (Sandbox Code Playgroud)
我怎么能得到"div class ="mydiv""的内容,比方说,我想要的.
?p?i want got it?/p?
?div?
?p? good ?/p?
?a? boy ?/a?
?/div?
Run Code Online (Sandbox Code Playgroud)
我已经尝试过HTMLParser,但我认为它不能.还是其他的?谢谢!
使用BeautifulSoup,它很简单:
from BeautifulSoup import BeautifulSoup
html = """
<body>
<div class="mydiv">
<p>i want got it</p>
<div>
<p> good </p>
<a> boy </a>
</div>
</div>
</body>
"""
soup = BeautifulSoup(html)
result = soup.findAll('div', {'class': 'mydiv'})
tag = result[0]
print tag.contents
[u'\n', <p>i want got it</p>, u'\n', <div>
<p> good </p>
<a> boy </a>
</div>, u'\n']
Run Code Online (Sandbox Code Playgroud)