nak*_*iya 17 html python webclient
我对Python不太熟悉.我试图从以下页面中提取艺术家名称(开始:)):http://www.infolanka.com/miyuru_gee/art/art.html.
如何检索页面?我的两个主要问题是; 使用什么功能以及如何从页面中过滤掉无用的链接?
Vin*_*cer 22
使用urlib和lxml.html的示例:
import urllib
from lxml import html
url = "http://www.infolanka.com/miyuru_gee/art/art.html"
page = html.fromstring(urllib.urlopen(url).read())
for link in page.xpath("//a"):
print "Name", link.text, "URL", link.get("href")
output >>
[('Aathma Liyanage', 'athma.html'),
('Abewardhana Balasuriya', 'abewardhana.html'),
('Aelian Thilakeratne', 'aelian_thi.html'),
('Ahamed Mohideen', 'ahamed.html'),
]
Run Code Online (Sandbox Code Playgroud)
我认为"eyquem"方式也是我的选择,但我喜欢使用httplib2代替urllib.urllib2对于这项工作来说太低级了.
import httplib2, re
pat = re.compile('<DT><a href="[^"]+">(.+?)</a>') http = httplib2.Http() headers, body = http.request("http://www.infolanka.com/miyuru_gee/art/art.html")
li = pat.findall(body) print li
小智 6
检查我的朋友
import urllib.request
import re
pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')
url = 'http://www.infolanka.com/miyuru_gee/art/art.html'
sock = urllib.request.urlopen(url).read().decode("utf-8")
li = pat.findall(sock)
print(li)
Run Code Online (Sandbox Code Playgroud)