在python中下载html？

Question

在python中下载html？

我试图在您单击浏览器中的链接时下载通过javascript操作请求的页面的html.我可以下载第一页,因为它有一个通用的URL:

http://www.locationary.com/stats/hotzone.jsp?hz=1

Run Code Online (Sandbox Code Playgroud)

但是页面底部有一些链接是数字(1到10).因此,如果您点击一个,它会转到,例如,第2页:

http://www.locationary.com/stats/hotzone.jsp?ACTION_TOKEN=hotzone_jsp$JspView$NumericAction&inPageNumber=2

Run Code Online (Sandbox Code Playgroud)

当我将该URL放入我的程序并尝试下载html时,它会在网站上显示不同页面的html,我认为它是主页.

如何获取使用javascript的此URL的html以及何时没有特定的URL？

谢谢.

码:

import urllib
import urllib2
import cookielib
import re

URL = ''

def load(url):

    data = urllib.urlencode({"inUserName":"email", "inUserPass":"password"})
    jar = cookielib.FileCookieJar("cookies")
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
    opener.addheaders.append(('User-agent', 'Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1'))
    opener.addheaders.append(('Referer', 'http://www.locationary.com/'))
    opener.addheaders.append(('Cookie','site_version=REGULAR'))
    request = urllib2.Request("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction", data)
    response = opener.open(request)
    page = opener.open("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction").read()

    h = response.info().headers
    jsid = re.findall(r'Set-Cookie: (.*);', str(h[5]))
    data = urllib.urlencode({"inUserName":"email", "inUserPass":"password"})
    jar = cookielib.FileCookieJar("cookies")
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
    opener.addheaders.append(('User-agent', 'Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1'))
    opener.addheaders.append(('Referer', 'http://www.locationary.com/'))
    opener.addheaders.append(('Cookie','site_version=REGULAR; ' + str(jsid[0])))
    request = urllib2.Request("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction", data)
    response = opener.open(request)
    page = opener.open(url).read()
    print page

load(URL)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Pau*_*ine 1

selenium 工具套件中的selenium webdriver使用标准浏览器来检索 HTML（其主要目标是 Web 应用程序的测试自动化），因此它非常适合废弃富含 javascript 的应用程序。它有很好的 Python 绑定。

我倾向于在所有 ajax 内容被触发后使用 selenium 来获取页面源，并使用BeautifulSoup之类的东西来解析它（BeautifulSoup 可以很好地处理格式错误的 HTML）。

归档时间：	13 年，3 月前
查看次数：	511 次
最近记录：	13 年，3 月前