如何通过Python脚本从网站上获取pdf链接

Question

如何通过Python脚本从网站上获取pdf链接

use*_*358 2 python hyperlink web

我经常需要从网站下载pdf,但有时它们不在同一页面上.他们将链接划分为分页,我必须点击每一页获取链接.

我正在学习python,我想编写一些脚本,我可以把weburl和它从该webiste中提取pdf链接.

我是python的新手,所以任何人都可以给我指示我该怎么做

Answer 1

sam*_*ias 7

非常简单urllib2,urlparse和lxml.因为你是Python的新手,所以我更详细地评论了一些事情:

# modules we're using (you'll need to download lxml)
import lxml.html, urllib2, urlparse

# the url of the page you want to scrape
base_url = 'http://www.renderx.com/demos/examples.html'

# fetch the page
res = urllib2.urlopen(base_url)

# parse the response into an xml tree
tree = lxml.html.fromstring(res.read())

# construct a namespace dictionary to pass to the xpath() call
# this lets us use regular expressions in the xpath
ns = {'re': 'http://exslt.org/regular-expressions'}

# iterate over all <a> tags whose href ends in ".pdf" (case-insensitive)
for node in tree.xpath('//a[re:test(@href, "\.pdf$", "i")]', namespaces=ns):

    # print the href, joining it to the base_url
    print urlparse.urljoin(base_url, node.attrib['href'])

Run Code Online (Sandbox Code Playgroud)

结果:

http://www.renderx.com/files/demos/examples/Fund.pdf
http://www.renderx.com/files/demos/examples/FundII.pdf
http://www.renderx.com/files/demos/examples/FundIII.pdf
...

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，5 月前
查看次数：	6329 次
最近记录：	7 年，2 月前