相关疑难解决方法(0)

如何使用Python抓取需要先登录的网站

首先,我认为值得一提的是,我知道有很多类似的问题但是没有一个对我有用......

我是Python,html和web scraper的新手.我正试图从需要先登录的网站上抓取用户信息.在我的测试中,我使用刮刀github的电子邮件设置作为示例.主页是" https://github.com/login ",目标页面是" https://github.com/settings/emails "

这是我尝试过的方法列表

##################################### Method 1
import mechanize
import cookielib
from BeautifulSoup import BeautifulSoup
import html2text

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)


br.addheaders = [('User-agent', 'Chrome')]

# The site we will navigate into, handling it's session
br.open('https://github.com/login')

for f in br.forms():
    print f

br.select_form(nr=0)

# User credentials
br.form['login'] = 'myusername'
br.form['password'] = 'mypwd'

# Login
br.submit()

br.open('github.com/settings/emails').read()


################ Method 2
import urllib, urllib2, cookielib …

Run Code Online (Sandbox Code Playgroud)

python cookies authorization http scraper

use*_*451

2013 11-18

31
推荐指数

3
解决办法

7万
查看次数