如何刮一个需要使用python和beautifulsoup登录的网站?

use*_*486 46 python beautifulsoup web-scraping

如果我想先刮一个需要用密码登录的网站,怎样才能开始使用beautifulsoup4库用python抓它?以下是我对不需要登录的网站所做的工作.

from bs4 import BeautifulSoup    
import urllib2 
url = urllib2.urlopen("http://www.python.org")    
content = url.read()    
soup = BeautifulSoup(content)
Run Code Online (Sandbox Code Playgroud)

如何更改代码以适应登录?假设我要抓的网站是一个需要登录的论坛.一个例子是http://forum.arduino.cc/index.php

4d4*_*d4c 58

你可以使用机械化:

import mechanize
from bs4 import BeautifulSoup
import urllib2 
import cookielib

cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("https://id.arduino.cc/auth/login/")

br.select_form(nr=0)
br.form['username'] = 'username'
br.form['password'] = 'password.'
br.submit()

print br.response().read()
Run Code Online (Sandbox Code Playgroud)

或urllib - 使用urllib2登录网站

  • @DanS.这已经很晚了..但对于其他任何人看到这个,是的..第一个用户名是表单字段名称..第二个用户名是你输入该表单字段的实际用户名. (3认同)
  • 这个答案对于Python 2是正确的,对于Python 3我建议使用mechanicalsoup https://mechanicalsoup.readthedocs.io/en/stable/ (3认同)
  • @JérômeB 这是一个毫无帮助的评论,因为你还没有在这里提供答案。我想您指的是此页面 https://mechanicalsoup.readthedocs.io/en/stable/tutorial.html?highlight=login#a-more-complete-example-logging-in-into-github ... (2认同)

Ade*_*lin 19

从我的观点来看,有一种更简单的方法可以让您在没有seleniummechanize或其他第 3 方工具的情况下到达那里,尽管它是自动化的。

基本上,当您以正常方式登录站点时,您会使用您的凭据以独特的方式识别自己,此后的所有其他交互都会使用相同的身份,这些身份会在cookies和 中存储headers一段时间。

你需要做的就是使用相同的cookiesheaders当你让你的HTTP请求,你会英寸

要复制它,请按照下列步骤操作:

  1. 在浏览器中,打开开发者工具
  2. 进入网站,并登录
  3. 登录,进入网络选项卡,然后 刷新页面
    。在这一点上,你应该看到请求的列表,上面一个是实际的网站-这将是我们的重点,因为它包含了与身份数据我们可以使用 Python 和 BeautifulSoup 来抓取它
  4. 右键单击站点请求(最上面的一个),将鼠标悬停在 上copy,然后copy as cURL
    像这样:

在此处输入图片说明

  1. 然后转到这个将 cURL 转换为 python 请求的站点:https : //curl.trillworks.com/
  2. 获取python代码并使用生成的cookiesheaders继续抓取

  • 谢谢你的好建议。如果登录页面请求被重定向(状态代码 30x),步骤 3 可能会有所不同。在这种情况下,登录后就无法看到“网络”选项卡了。我建议改为:a) 从登录“<form>”的“action”属性获取地址 (URL) 并将其替换为 cURL,或者 b) 打开“网络”选项卡;等待登录页面并加载所有资源;填写登录表格;清除网络选项卡;提交登录表单 -> 然后“网络”选项卡中的第一个请求将包含所需的地址 (URL)。 (3认同)
  • 非常感谢你的回答!我目前用它来抓取公寓租金的网站。不幸的是,我不太确定如何快速应用这种刮擦技术。因此,我提出了一个问题。你对此有什么想法吗?/sf/ask/4698788531/ (2认同)

Pla*_*tta 7

如果您选择硒,那么您可以执行以下操作:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait

# If you want to open Chrome
driver = webdriver.Chrome()
# If you want to open Firefox
driver = webdriver.Firefox()

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("YourUsername")
password.send_keys("YourPassword")
driver.find_element_by_id("submit_btn").click()
Run Code Online (Sandbox Code Playgroud)

但是,如果您坚持只使用 BeautifulSoup,则可以使用requests或 之类的库来实现urllib。基本上,您所要做的就是POST将数据作为带有 URL 的有效负载。

import requests
from bs4 import BeautifulSoup

login_url = 'http://example.com/login'
data = {
    'username': 'your_username',
    'password': 'your_password'
}

with requests.Session() as s:
    response = requests.post(login_url , data)
    print(response.text)
    index_page= s.get('http://example.com')
    soup = BeautifulSoup(index_page.text, 'html.parser')
    print(soup.title)
Run Code Online (Sandbox Code Playgroud)


小智 7

由于未指定 Python 版本,以下是我对 Python 3 的看法,没有任何外部库 (StackOverflow)。登录后照常使用 BeautifulSoup 或任何其他类型的抓取。

同样,我的 GitHub 上的脚本在这里

整个脚本按照 StackOverflow 指南复制如下:

# Login to website using just Python 3 Standard Library
import urllib.parse
import urllib.request
import http.cookiejar

def scraper_login():
    ####### change variables here, like URL, action URL, user, pass
    # your base URL here, will be used for headers and such, with and without https://
    base_url = 'www.example.com'
    https_base_url = 'https://' + base_url

    # here goes URL that's found inside form action='.....'
    #   adjust as needed, can be all kinds of weird stuff
    authentication_url = https_base_url + '/login'

    # username and password for login
    username = 'yourusername'
    password = 'SoMePassw0rd!'

    # we will use this string to confirm a login at end
    check_string = 'Logout'

    ####### rest of the script is logic
    # but you will need to tweak couple things maybe regarding "token" logic
    #   (can be _token or token or _token_ or secret ... etc)

    # big thing! you need a referer for most pages! and correct headers are the key
    headers={"Content-Type":"application/x-www-form-urlencoded",
    "User-agent":"Mozilla/5.0 Chrome/81.0.4044.92",    # Chrome 80+ as per web search
    "Host":base_url,
    "Origin":https_base_url,
    "Referer":https_base_url}

    # initiate the cookie jar (using : http.cookiejar and urllib.request)
    cookie_jar = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
    urllib.request.install_opener(opener)

    # first a simple request, just to get login page and parse out the token
    #       (using : urllib.request)
    request = urllib.request.Request(https_base_url)
    response = urllib.request.urlopen(request)
    contents = response.read()

    # parse the page, we look for token eg. on my page it was something like this:
    #    <input type="hidden" name="_token" value="random1234567890qwertzstring">
    #       this can probably be done better with regex and similar
    #       but I'm newb, so bear with me
    html = contents.decode("utf-8")
    # text just before start and just after end of your token string
    mark_start = '<input type="hidden" name="_token" value="'
    mark_end = '">'
    # index of those two points
    start_index = html.find(mark_start) + len(mark_start)
    end_index = html.find(mark_end, start_index)
    # and text between them is our token, store it for second step of actual login
    token = html[start_index:end_index]

    # here we craft our payload, it's all the form fields, including HIDDEN fields!
    #   that includes token we scraped earler, as that's usually in hidden fields
    #   make sure left side is from "name" attributes of the form,
    #       and right side is what you want to post as "value"
    #   and for hidden fields make sure you replicate the expected answer,
    #       eg. "token" or "yes I agree" checkboxes and such
    payload = {
        '_token':token,
    #    'name':'value',    # make sure this is the format of all additional fields !
        'login':username,
        'password':password
    }

    # now we prepare all we need for login
    #   data - with our payload (user/pass/token) urlencoded and encoded as bytes
    data = urllib.parse.urlencode(payload)
    binary_data = data.encode('UTF-8')
    # and put the URL + encoded data + correct headers into our POST request
    #   btw, despite what I thought it is automatically treated as POST
    #   I guess because of byte encoded data field you don't need to say it like this:
    #       urllib.request.Request(authentication_url, binary_data, headers, method='POST')
    request = urllib.request.Request(authentication_url, binary_data, headers)
    response = urllib.request.urlopen(request)
    contents = response.read()

    # just for kicks, we confirm some element in the page that's secure behind the login
    #   we use a particular string we know only occurs after login,
    #   like "logout" or "welcome" or "member", etc. I found "Logout" is pretty safe so far
    contents = contents.decode("utf-8")
    index = contents.find(check_string)
    # if we find it
    if index != -1:
        print(f"We found '{check_string}' at index position : {index}")
    else:
        print(f"String '{check_string}' was not found! Maybe we did not login ?!")

scraper_login()
Run Code Online (Sandbox Code Playgroud)


小智 5

您可以使用selenium登录并检索页面源,然后可以将其传递给Beautiful Soup以提取所需的数据。

  • 考虑在答案中提供更多详细信息。 (19认同)