使用 Python 进行网页抓取时如何绕过 cookie 协议页面？

Question

使用 Python 进行网页抓取时如何绕过 cookie 协议页面？

Vin*_*que 4 python web-scraping python-requests

我被 cookie 协议页面弄伤了鼻子......

我在做什么：

import requests
url = "https://stockhouse.com/community/bullboards/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup)

Run Code Online (Sandbox Code Playgroud)

它从 cookie 协议页面返回 HTML。然后我要寻找的是绕过此页面并在我们接受 cookie 后抓取实际页面的内容......

我尝试了这个问题的代码：

cookies = dict(BCPermissionLevel='PERSONAL')
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"}, cookies=cookies)

Run Code Online (Sandbox Code Playgroud)

但我仍然从 cookie 页面获取 html。

注意：我成功地使用了 Selenium，但 selenium 是一个非常低效的最后手段......

Answer 1

And*_*ely 6

对于这个网站，指定“虚拟”cookie 就足够了privacy-policy：

import requests
from bs4 import BeautifulSoup

url = "https://stockhouse.com/community/bullboards/"

cookies = {
    'privacy-policy': '1,XXXXXXXXXXXXXXXXXXXXXX'
}

r = requests.get(url, cookies=cookies)
soup = BeautifulSoup(r.content, "html.parser")

for h3 in soup.select('h3'):
    print(h3.get_text(strip=True))

Run Code Online (Sandbox Code Playgroud)

打印标题：

Perfect timing: Mach offer no good as per AMF
'Explosive' Move Up Next Week"
Repsol/ Tullow
Assessment
$5.96
Possible Deal?
Massive Investor(s) Buys Over 1 Million JE Shares Last Close
This CEO is really on the ball , right flubber
slow bb
Situation
Loadddddd
Numerology of the number 36
TIMBERRRR!!.. it will go down fast to $1.50
Employees in the know do the right thing Whistelblow
News finally
Will be bought out...halt coming
Green today
Somebody is buying
re re :350 mil is not enough
And Trump fk up another day

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，5 月前
查看次数：	6232 次
最近记录：	4 年，9 月前