use*_*486 46 python beautifulsoup web-scraping
如果我想先刮一个需要用密码登录的网站,怎样才能开始使用beautifulsoup4库用python抓它?以下是我对不需要登录的网站所做的工作.
from bs4 import BeautifulSoup
import urllib2
url = urllib2.urlopen("http://www.python.org")
content = url.read()
soup = BeautifulSoup(content)
Run Code Online (Sandbox Code Playgroud)
如何更改代码以适应登录?假设我要抓的网站是一个需要登录的论坛.一个例子是http://forum.arduino.cc/index.php
4d4*_*d4c 58
你可以使用机械化:
import mechanize
from bs4 import BeautifulSoup
import urllib2
import cookielib
cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("https://id.arduino.cc/auth/login/")
br.select_form(nr=0)
br.form['username'] = 'username'
br.form['password'] = 'password.'
br.submit()
print br.response().read()
Run Code Online (Sandbox Code Playgroud)
或urllib - 使用urllib2登录网站
Ade*_*lin 19
从我的观点来看,有一种更简单的方法可以让您在没有selenium或mechanize或其他第 3 方工具的情况下到达那里,尽管它是半自动化的。
基本上,当您以正常方式登录站点时,您会使用您的凭据以独特的方式识别自己,此后的所有其他交互都会使用相同的身份,这些身份会在cookies和 中存储headers一段时间。
你需要做的就是使用相同的cookies和headers当你让你的HTTP请求,你会英寸
要复制它,请按照下列步骤操作:
copy,然后copy as
cURLcookies并headers继续抓取如果您选择硒,那么您可以执行以下操作:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
# If you want to open Chrome
driver = webdriver.Chrome()
# If you want to open Firefox
driver = webdriver.Firefox()
username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("YourUsername")
password.send_keys("YourPassword")
driver.find_element_by_id("submit_btn").click()
Run Code Online (Sandbox Code Playgroud)
但是,如果您坚持只使用 BeautifulSoup,则可以使用requests或 之类的库来实现urllib。基本上,您所要做的就是POST将数据作为带有 URL 的有效负载。
import requests
from bs4 import BeautifulSoup
login_url = 'http://example.com/login'
data = {
'username': 'your_username',
'password': 'your_password'
}
with requests.Session() as s:
response = requests.post(login_url , data)
print(response.text)
index_page= s.get('http://example.com')
soup = BeautifulSoup(index_page.text, 'html.parser')
print(soup.title)
Run Code Online (Sandbox Code Playgroud)
小智 7
由于未指定 Python 版本,以下是我对 Python 3 的看法,没有任何外部库 (StackOverflow)。登录后照常使用 BeautifulSoup 或任何其他类型的抓取。
整个脚本按照 StackOverflow 指南复制如下:
# Login to website using just Python 3 Standard Library
import urllib.parse
import urllib.request
import http.cookiejar
def scraper_login():
####### change variables here, like URL, action URL, user, pass
# your base URL here, will be used for headers and such, with and without https://
base_url = 'www.example.com'
https_base_url = 'https://' + base_url
# here goes URL that's found inside form action='.....'
# adjust as needed, can be all kinds of weird stuff
authentication_url = https_base_url + '/login'
# username and password for login
username = 'yourusername'
password = 'SoMePassw0rd!'
# we will use this string to confirm a login at end
check_string = 'Logout'
####### rest of the script is logic
# but you will need to tweak couple things maybe regarding "token" logic
# (can be _token or token or _token_ or secret ... etc)
# big thing! you need a referer for most pages! and correct headers are the key
headers={"Content-Type":"application/x-www-form-urlencoded",
"User-agent":"Mozilla/5.0 Chrome/81.0.4044.92", # Chrome 80+ as per web search
"Host":base_url,
"Origin":https_base_url,
"Referer":https_base_url}
# initiate the cookie jar (using : http.cookiejar and urllib.request)
cookie_jar = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
urllib.request.install_opener(opener)
# first a simple request, just to get login page and parse out the token
# (using : urllib.request)
request = urllib.request.Request(https_base_url)
response = urllib.request.urlopen(request)
contents = response.read()
# parse the page, we look for token eg. on my page it was something like this:
# <input type="hidden" name="_token" value="random1234567890qwertzstring">
# this can probably be done better with regex and similar
# but I'm newb, so bear with me
html = contents.decode("utf-8")
# text just before start and just after end of your token string
mark_start = '<input type="hidden" name="_token" value="'
mark_end = '">'
# index of those two points
start_index = html.find(mark_start) + len(mark_start)
end_index = html.find(mark_end, start_index)
# and text between them is our token, store it for second step of actual login
token = html[start_index:end_index]
# here we craft our payload, it's all the form fields, including HIDDEN fields!
# that includes token we scraped earler, as that's usually in hidden fields
# make sure left side is from "name" attributes of the form,
# and right side is what you want to post as "value"
# and for hidden fields make sure you replicate the expected answer,
# eg. "token" or "yes I agree" checkboxes and such
payload = {
'_token':token,
# 'name':'value', # make sure this is the format of all additional fields !
'login':username,
'password':password
}
# now we prepare all we need for login
# data - with our payload (user/pass/token) urlencoded and encoded as bytes
data = urllib.parse.urlencode(payload)
binary_data = data.encode('UTF-8')
# and put the URL + encoded data + correct headers into our POST request
# btw, despite what I thought it is automatically treated as POST
# I guess because of byte encoded data field you don't need to say it like this:
# urllib.request.Request(authentication_url, binary_data, headers, method='POST')
request = urllib.request.Request(authentication_url, binary_data, headers)
response = urllib.request.urlopen(request)
contents = response.read()
# just for kicks, we confirm some element in the page that's secure behind the login
# we use a particular string we know only occurs after login,
# like "logout" or "welcome" or "member", etc. I found "Logout" is pretty safe so far
contents = contents.decode("utf-8")
index = contents.find(check_string)
# if we find it
if index != -1:
print(f"We found '{check_string}' at index position : {index}")
else:
print(f"String '{check_string}' was not found! Maybe we did not login ?!")
scraper_login()
Run Code Online (Sandbox Code Playgroud)