如何使用Python获取HTML文件？

Question

如何使用Python获取HTML文件？

我对Python不太熟悉.我试图从以下页面中提取艺术家名称(开始:)):http://www.infolanka.com/miyuru_gee/art/art.html.

如何检索页面？我的两个主要问题是; 使用什么功能以及如何从页面中过滤掉无用的链接？

Answer 1

使用urlib和lxml.html的示例:

import urllib
from lxml import html

url = "http://www.infolanka.com/miyuru_gee/art/art.html"
page = html.fromstring(urllib.urlopen(url).read())

for link in page.xpath("//a"):
    print "Name", link.text, "URL", link.get("href")

output >>
    [('Aathma Liyanage', 'athma.html'),
     ('Abewardhana Balasuriya', 'abewardhana.html'),
     ('Aelian Thilakeratne', 'aelian_thi.html'),
     ('Ahamed Mohideen', 'ahamed.html'),
    ]

Run Code Online (Sandbox Code Playgroud)

在python 3中,您应该导入urllib.request并使用urllib.request.urlopen函数.见http://docs.python.org/3.2/library/urllib.request.html#urllib.request.urlopen (8认同)
urllib在这个时代已经过时,应该使用请求库或处理现代问题的东西. (2认同)

Answer 2

Mie*_*ere 7

我认为"eyquem"方式也是我的选择,但我喜欢使用httplib2代替urllib.urllib2对于这项工作来说太低级了.

import httplib2, re

pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')
http = httplib2.Http()
headers, body = http.request("http://www.infolanka.com/miyuru_gee/art/art.html")

li = pat.findall(body)
print li

Answer 3

use*_*312 6

使用urllib2获取页面.
使用BeautifulSoup解析HTML(页面)并获得您想要的!

Answer 4

小智 6

检查我的朋友

import urllib.request

import re

pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')

url = 'http://www.infolanka.com/miyuru_gee/art/art.html'

sock = urllib.request.urlopen(url).read().decode("utf-8")

li = pat.findall(sock)

print(li)

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，8 月前
查看次数：	72438 次
最近记录：	8 年，10 月前