在Python中浏览HTML DOM

Jak*_*est 4 html python dom httprequest

我正在寻找一个Python脚本(使用3.4.3),它从URL抓取一个HTML页面,并可以通过DOM来尝试查找特定元素.

我目前有这个:

#!/usr/bin/env python
import urllib.request

def getSite(url):
    return urllib.request.urlopen(url)

if __name__ == '__main__':
    content = getSite('http://www.google.com').read()
    print(content)
Run Code Online (Sandbox Code Playgroud)

当我打印内容时,它会打印出整个html页面,这与我想要的内容很接近......虽然我希望能够在DOM中导航而不是将其视为一个巨大的字符串.

我还是Python的新手,但有多种其他语言的经验(主要是Java,C#,C++,C,PHP,JS).我之前用Java做过类似的事情,但想在Python中尝试一下.

任何帮助表示赞赏.干杯!

Zac*_*tes 8

您可以使用许多不同的模块.例如,lxmlBeautifulSoup.

这是一个lxml例子:

import lxml.html

mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)

description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag

>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
Run Code Online (Sandbox Code Playgroud)

一个BeautifulSoup例子:

from bs4 import BeautifulSoup

mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)

description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute

>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
Run Code Online (Sandbox Code Playgroud)

请注意如何BeautifulSoup返回unicode字符串,而lxml不是.根据需要,这可能有用/有害.

  • @Shatu:一般来说,像`BeautifulSoup` 和`lxml` 这样的模块在性能上更好。 (2认同)