rob*_*txt 8 python web-scraping python-3.x python-requests
我创建了一个脚本来Customers who bought this item also bought
从这些页面的section下抓取不同书籍的名称。单击右箭头按钮后,您可以找到所有相关书籍。我在脚本中使用了两个不同的书籍链接来查看脚本的行为。
我在 post 请求中使用的有效负载是硬编码的,用于product_links
. 有效负载似乎在页面源中可用,但I can't find the right way to use it automatically
. 当我使用另一个书籍链接时,payload 中有几个 id 可能不相同,因此硬性 payload 似乎不是一个好主意。
我试过:
import requests
from bs4 import BeautifulSoup
product_links = [
'https://www.amazon.com/Essential-Keto-Diet-Beginners-2019/dp/1099697018/',
'https://www.amazon.com/Keto-Cookbook-Beginners-Low-Carb-Homemade/dp/B08QFBMSFT/'
]
url = 'https://www.amazon.com/acp/p13n-desktop-carousel/funjjvdbohwkuezi/getCarouselItems'
payload = {"aCarouselOptions":"{\"ajax\":{\"id_list\":[\"{\\\"id\\\":\\\"B07NYZJX2L\\\"}\",\"{\\\"id\\\":\\\"1939754445\\\"}\",\"{\\\"id\\\":\\\"1792145454\\\"}\",\"{\\\"id\\\":\\\"1073560988\\\"}\",\"{\\\"id\\\":\\\"1119578922\\\"}\",\"{\\\"id\\\":\\\"B083K5RRSG\\\"}\",\"{\\\"id\\\":\\\"B07SPSXHZ8\\\"}\",\"{\\\"id\\\":\\\"B08GG2RL1D\\\"}\",\"{\\\"id\\\":\\\"1507212305\\\"}\",\"{\\\"id\\\":\\\"B08QFBMSFT\\\"}\",\"{\\\"id\\\":\\\"164152247X\\\"}\",\"{\\\"id\\\":\\\"1673455980\\\"}\",\"{\\\"id\\\":\\\"B084DD8WHP\\\"}\",\"{\\\"id\\\":\\\"1706342667\\\"}\",\"{\\\"id\\\":\\\"1628603135\\\"}\",\"{\\\"id\\\":\\\"B08NZV2Z4N\\\"}\",\"{\\\"id\\\":\\\"1942411294\\\"}\",\"{\\\"id\\\":\\\"1507209924\\\"}\",\"{\\\"id\\\":\\\"1641520434\\\"}\",\"{\\\"id\\\":\\\"B084Z7627Q\\\"}\",\"{\\\"id\\\":\\\"B08NRXFZ98\\\"}\",\"{\\\"id\\\":\\\"1623159326\\\"}\",\"{\\\"id\\\":\\\"B0827DHLR6\\\"}\",\"{\\\"id\\\":\\\"B08TL5W56Z\\\"}\",\"{\\\"id\\\":\\\"1941169171\\\"}\",\"{\\\"id\\\":\\\"1645670945\\\"}\",\"{\\\"id\\\":\\\"B08GLSSNKF\\\"}\",\"{\\\"id\\\":\\\"B08RR4RJHB\\\"}\",\"{\\\"id\\\":\\\"B07WRQ4CF4\\\"}\",\"{\\\"id\\\":\\\"B08Y49Z3V1\\\"}\",\"{\\\"id\\\":\\\"B08LNX32ZL\\\"}\",\"{\\\"id\\\":\\\"1250621097\\\"}\",\"{\\\"id\\\":\\\"1628600071\\\"}\",\"{\\\"id\\\":\\\"1646115511\\\"}\",\"{\\\"id\\\":\\\"1705799507\\\"}\",\"{\\\"id\\\":\\\"B08XZCM2P4\\\"}\",\"{\\\"id\\\":\\\"1072855267\\\"}\",\"{\\\"id\\\":\\\"B08VCMWPB9\\\"}\",\"{\\\"id\\\":\\\"1623159229\\\"}\",\"{\\\"id\\\":\\\"B08KH2J3FM\\\"}\",\"{\\\"id\\\":\\\"B08D54RBGP\\\"}\",\"{\\\"id\\\":\\\"1507212992\\\"}\",\"{\\\"id\\\":\\\"1635653894\\\"}\",\"{\\\"id\\\":\\\"B01MUB7BUV\\\"}\",\"{\\\"id\\\":\\\"0358120861\\\"}\",\"{\\\"id\\\":\\\"B08FV23D3F\\\"}\",\"{\\\"id\\\":\\\"B08FNMP9YY\\\"}\",\"{\\\"id\\\":\\\"1671590902\\\"}\",\"{\\\"id\\\":\\\"1641527692\\\"}\",\"{\\\"id\\\":\\\"1628603917\\\"}\",\"{\\\"id\\\":\\\"B07ZHPQBVZ\\\"}\",\"{\\\"id\\\":\\\"B08Y49Y63B\\\"}\",\"{\\\"id\\\":\\\"B08T2QRSN3\\\"}\",\"{\\\"id\\\":\\\"1729392164\\\"}\",\"{\\\"id\\\":\\\"B08T46R6XC\\\"}\",\"{\\\"id\\\":\\\"B08RRF5V1D\\\"}\",\"{\\\"id\\\":\\\"1592339727\\\"}\",\"{\\\"id\\\":\\\"1628602929\\\"}\",\"{\\\"id\\\":\\\"1984857088\\\"}\",\"{\\\"id\\\":\\\"0316529583\\\"}\",\"{\\\"id\\\":\\\"1641524820\\\"}\",\"{\\\"id\\\":\\\"1628602635\\\"}\",\"{\\\"id\\\":\\\"B00GRIR87M\\\"}\",\"{\\\"id\\\":\\\"B08FBHN5H7\\\"}\",\"{\\\"id\\\":\\\"B06ZYSS7HS\\\"}\"]},\"autoAdjustHeightFreescroll\":true,\"first_item_flush_left\":false,\"initThreshold\":100,\"loadingThresholdPixels\":100,\"name\":\"p13n-sc-shoveler_n1in5tlbg2h\",\"nextRequestSize\":6,\"set_size\":65}","faceoutspecs":"{}","faceoutkataname":"GeneralFaceout","individuals":"0","language":"en-US","linkparameters":"{\"pd_rd_w\":\"eouzj\",\"pf_rd_p\":\"45451e33-456f-46b5-8f06-aedad504c3d0\",\"pf_rd_r\":\"6Q3MPZHQQ2ESWZND1K8T\",\"pd_rd_r\":\"e5e43c03-d78d-41d3-9064-87af93f9856b\",\"pd_rd_wg\":\"PdhmI\"}","marketplaceid":"ATVPDKIKX0DER","name":"p13n-sc-shoveler_n1in5tlbg2h","offset":"6","reftagprefix":"pd_sim","aDisplayStrategy":"swap","aTransitionStrategy":"swap","aAjaxStrategy":"promise","ids":["{\"id\":\"B07SPSXHZ8\"}","{\"id\":\"B08GG2RL1D\"}","{\"id\":\"1507212305\"}","{\"id\":\"B08QFBMSFT\"}","{\"id\":\"164152247X\"}","{\"id\":\"1673455980\"}","{\"id\":\"B084DD8WHP\"}","{\"id\":\"1706342667\"}","{\"id\":\"1628603135\"}"],"indexes":[6,7,8,9,10,11,12,13,14]}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
# for product_link in product_links:
s.headers['x-amz-acp-params'] = "tok=0DV5j8DDJsH8JQfdVFxJFD3p6AZraMOZTik-kgzNi08;ts=1619674837835;rid=ER1GSMM13VTETPS90K43;d1=251;d2=0;tpm=CGHBD;ref=rtpb"
res = s.post(url,json=payload)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("li.a-carousel-card-fragment > a.a-link-normal > div[data-rows]"):
print(item.text)
Run Code Online (Sandbox Code Playgroud)
如何在customers who bought
没有硬编码有效负载的情况下从部分中抓取所有书籍?
当您查询产品 URL 时,获取轮播数据所需的一切都在初始请求中。
您需要获取完整的 product HTML
,提取轮播数据并重用其中的部分以构建可用于后续POST
请求的有效负载。
但是,获得产品HTML
是最困难的部分,至少在我看来Amazon
,如果您要求HTML
太频繁,要么会阻止或抛出 CAPTCHA 。
使用代理或 VPN 会有所帮助。交换产品 URL 有时也有帮助。
总结起来,关键是拿到产品HTML
。AFAIK,后续请求很容易发出并且不会受到限制。
以下是如何从轮播中获取数据:
import json
import re
import requests
from bs4 import BeautifulSoup
# The chunk is how many carousel items are going to be requested for;
# this can vary from 4 - 10 items, as on the web-page.
# Also, the other list is used as the indexes key in the payload.
def get_idx_and_indexes(carousel_ids: list, chunk: int = 5) -> iter:
for index in range(0, len(carousel_ids), chunk):
tmp = carousel_ids[index:index + chunk]
yield tmp, [carousel_ids.index(item) for item in tmp]
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/90.0.4430.93 Safari/537.36",
}
product_url = 'https://www.amazon.de/Rust-Programming-Language-Covers-2018/dp/1718500440/'
# Getting the product HTML as it carries all the carousel data items
with requests.Session() as session:
r = session.get("https://www.amazon.com", headers=headers)
page = session.get(product_url, headers=headers)
# This is where the carousel data sits along with all the items needed to make
# the following requests e.g. items, acp-params, linkparameters, marketplaceid etc.
initial_soup = BeautifulSoup(
re.search(r"<!--CardsClient-->(.*)<input", page.text).group(1),
"lxml",
).find_all("div")
# Preparing all the details for subsequent requests to carousel_endpoint
item_ids = json.loads(initial_soup[3]["data-a-carousel-options"])["ajax"]["id_list"]
payload = {
"aAjaxStrategy": "promise",
"aCarouselOptions": initial_soup[3]["data-a-carousel-options"],
"aDisplayStrategy": "swap",
"aTransitionStrategy": "swap",
"faceoutkataname": "GeneralFaceout",
"faceoutspecs": "{}",
"individuals": "0",
"language": "en-US",
"linkparameters": initial_soup[0]["data-acp-tracking"],
"marketplaceid": initial_soup[3]["data-marketplaceid"],
"name": "p13n-sc-shoveler_hgm4oj1hneo", # this changes by can be ignored
"offset": "6",
"reftagprefix": "pd_sim",
}
headers.update(
{
"x-amz-acp-params": initial_soup[0]["data-acp-params"],
"x-requested-with": "XMLHttpRequest",
}
)
# looping through the carousel data and performing requests
carousel_endpoint = " https://www.amazon.com/acp/p13n-desktop-carousel/funjjvdbohwkuezi/getCarouselItems"
for ids, indexes in get_idx_and_indexes(item_ids):
payload["ids"] = ids
payload["indexes"] = indexes
# The actual carousel data
response = session.post(carousel_endpoint, json=payload, headers=headers)
carousel = BeautifulSoup(response.text, "lxml").find_all("a")
print("\n".join(a.getText() for a in carousel))
Run Code Online (Sandbox Code Playgroud)
这应该输出:
Cracking the Coding Interview: 189 Programming Questions and Solutions
Gayle Laakmann McDowell
4.7 out of 5 starsâ4,864
#1 Best Sellerin Computer Hacking
$24.00
Container Security: Fundamental Technology Concepts that Protect Containerized Applications
Liz Rice
4.7 out of 5 starsâ102
$35.42
Linux Bible
Christopher Negus
4.8 out of 5 starsâ245
#1 Best Sellerin Linux Servers
$31.99
System Design Interview â An insider's guide, Second Edition
Alex Xu
4.5 out of 5 starsâ568
#1 Best Sellerin Bioinformatics
$24.99
Ansible for DevOps: Server and configuration management for humans
Jeff Geerling
4.6 out of 5 starsâ127
$17.35
Effective C: An Introduction to Professional C Programming
Robert C. Seacord
4.5 out of 5 starsâ94
$32.99
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
Aurélien Géron
4.8 out of 5 starsâ1,954
#1 Best Sellerin Computer Neural Networks
$32.93
Head First Design Patterns: Building Extensible and Maintainable Object-Oriented Software
Eric Freeman
4.7 out of 5 starsâ67
$41.45
Fluent Python: Clear, Concise, and Effective Programming
Luciano Ramalho
4.6 out of 5 starsâ523
54 offers from $32.24
TCP/IP Illustrated, Volume 1: The Protocols (Addison-Wesley Professional Computing Series)
4.6 out of 5 starsâ199
$63.26
Operating Systems: Three Easy Pieces
4.7 out of 5 starsâ224
#1 Best Sellerin Computer Operating Systems Theory
$24.61
Software Engineering at Google: Lessons Learned from Programming Over Time
Titus Winters
4.6 out of 5 starsâ243
$44.52
and so on ...
Run Code Online (Sandbox Code Playgroud)