在没有硬编码有效负载的情况下无法从一个部分中抓取所有书籍

rob*_*txt 8 python web-scraping python-3.x python-requests

我创建了一个脚本来Customers who bought this item also bought这些页面的section下抓取不同书籍的名称。单击右箭头按钮后,您可以找到所有相关书籍。我在脚本中使用了两个不同的书籍链接来查看脚本的行为。

我在 post 请求中使用的有效负载是硬编码的,用于product_links. 有效负载似乎在页面源中可用,但I can't find the right way to use it automatically. 当我使用另一个书籍链接时,payload 中有几个 id 可能不相同,因此硬性 payload 似乎不是一个好主意。

我试过:

import requests
from bs4 import BeautifulSoup

product_links = [
    'https://www.amazon.com/Essential-Keto-Diet-Beginners-2019/dp/1099697018/',
    'https://www.amazon.com/Keto-Cookbook-Beginners-Low-Carb-Homemade/dp/B08QFBMSFT/'
]

url = 'https://www.amazon.com/acp/p13n-desktop-carousel/funjjvdbohwkuezi/getCarouselItems'
payload = {"aCarouselOptions":"{\"ajax\":{\"id_list\":[\"{\\\"id\\\":\\\"B07NYZJX2L\\\"}\",\"{\\\"id\\\":\\\"1939754445\\\"}\",\"{\\\"id\\\":\\\"1792145454\\\"}\",\"{\\\"id\\\":\\\"1073560988\\\"}\",\"{\\\"id\\\":\\\"1119578922\\\"}\",\"{\\\"id\\\":\\\"B083K5RRSG\\\"}\",\"{\\\"id\\\":\\\"B07SPSXHZ8\\\"}\",\"{\\\"id\\\":\\\"B08GG2RL1D\\\"}\",\"{\\\"id\\\":\\\"1507212305\\\"}\",\"{\\\"id\\\":\\\"B08QFBMSFT\\\"}\",\"{\\\"id\\\":\\\"164152247X\\\"}\",\"{\\\"id\\\":\\\"1673455980\\\"}\",\"{\\\"id\\\":\\\"B084DD8WHP\\\"}\",\"{\\\"id\\\":\\\"1706342667\\\"}\",\"{\\\"id\\\":\\\"1628603135\\\"}\",\"{\\\"id\\\":\\\"B08NZV2Z4N\\\"}\",\"{\\\"id\\\":\\\"1942411294\\\"}\",\"{\\\"id\\\":\\\"1507209924\\\"}\",\"{\\\"id\\\":\\\"1641520434\\\"}\",\"{\\\"id\\\":\\\"B084Z7627Q\\\"}\",\"{\\\"id\\\":\\\"B08NRXFZ98\\\"}\",\"{\\\"id\\\":\\\"1623159326\\\"}\",\"{\\\"id\\\":\\\"B0827DHLR6\\\"}\",\"{\\\"id\\\":\\\"B08TL5W56Z\\\"}\",\"{\\\"id\\\":\\\"1941169171\\\"}\",\"{\\\"id\\\":\\\"1645670945\\\"}\",\"{\\\"id\\\":\\\"B08GLSSNKF\\\"}\",\"{\\\"id\\\":\\\"B08RR4RJHB\\\"}\",\"{\\\"id\\\":\\\"B07WRQ4CF4\\\"}\",\"{\\\"id\\\":\\\"B08Y49Z3V1\\\"}\",\"{\\\"id\\\":\\\"B08LNX32ZL\\\"}\",\"{\\\"id\\\":\\\"1250621097\\\"}\",\"{\\\"id\\\":\\\"1628600071\\\"}\",\"{\\\"id\\\":\\\"1646115511\\\"}\",\"{\\\"id\\\":\\\"1705799507\\\"}\",\"{\\\"id\\\":\\\"B08XZCM2P4\\\"}\",\"{\\\"id\\\":\\\"1072855267\\\"}\",\"{\\\"id\\\":\\\"B08VCMWPB9\\\"}\",\"{\\\"id\\\":\\\"1623159229\\\"}\",\"{\\\"id\\\":\\\"B08KH2J3FM\\\"}\",\"{\\\"id\\\":\\\"B08D54RBGP\\\"}\",\"{\\\"id\\\":\\\"1507212992\\\"}\",\"{\\\"id\\\":\\\"1635653894\\\"}\",\"{\\\"id\\\":\\\"B01MUB7BUV\\\"}\",\"{\\\"id\\\":\\\"0358120861\\\"}\",\"{\\\"id\\\":\\\"B08FV23D3F\\\"}\",\"{\\\"id\\\":\\\"B08FNMP9YY\\\"}\",\"{\\\"id\\\":\\\"1671590902\\\"}\",\"{\\\"id\\\":\\\"1641527692\\\"}\",\"{\\\"id\\\":\\\"1628603917\\\"}\",\"{\\\"id\\\":\\\"B07ZHPQBVZ\\\"}\",\"{\\\"id\\\":\\\"B08Y49Y63B\\\"}\",\"{\\\"id\\\":\\\"B08T2QRSN3\\\"}\",\"{\\\"id\\\":\\\"1729392164\\\"}\",\"{\\\"id\\\":\\\"B08T46R6XC\\\"}\",\"{\\\"id\\\":\\\"B08RRF5V1D\\\"}\",\"{\\\"id\\\":\\\"1592339727\\\"}\",\"{\\\"id\\\":\\\"1628602929\\\"}\",\"{\\\"id\\\":\\\"1984857088\\\"}\",\"{\\\"id\\\":\\\"0316529583\\\"}\",\"{\\\"id\\\":\\\"1641524820\\\"}\",\"{\\\"id\\\":\\\"1628602635\\\"}\",\"{\\\"id\\\":\\\"B00GRIR87M\\\"}\",\"{\\\"id\\\":\\\"B08FBHN5H7\\\"}\",\"{\\\"id\\\":\\\"B06ZYSS7HS\\\"}\"]},\"autoAdjustHeightFreescroll\":true,\"first_item_flush_left\":false,\"initThreshold\":100,\"loadingThresholdPixels\":100,\"name\":\"p13n-sc-shoveler_n1in5tlbg2h\",\"nextRequestSize\":6,\"set_size\":65}","faceoutspecs":"{}","faceoutkataname":"GeneralFaceout","individuals":"0","language":"en-US","linkparameters":"{\"pd_rd_w\":\"eouzj\",\"pf_rd_p\":\"45451e33-456f-46b5-8f06-aedad504c3d0\",\"pf_rd_r\":\"6Q3MPZHQQ2ESWZND1K8T\",\"pd_rd_r\":\"e5e43c03-d78d-41d3-9064-87af93f9856b\",\"pd_rd_wg\":\"PdhmI\"}","marketplaceid":"ATVPDKIKX0DER","name":"p13n-sc-shoveler_n1in5tlbg2h","offset":"6","reftagprefix":"pd_sim","aDisplayStrategy":"swap","aTransitionStrategy":"swap","aAjaxStrategy":"promise","ids":["{\"id\":\"B07SPSXHZ8\"}","{\"id\":\"B08GG2RL1D\"}","{\"id\":\"1507212305\"}","{\"id\":\"B08QFBMSFT\"}","{\"id\":\"164152247X\"}","{\"id\":\"1673455980\"}","{\"id\":\"B084DD8WHP\"}","{\"id\":\"1706342667\"}","{\"id\":\"1628603135\"}"],"indexes":[6,7,8,9,10,11,12,13,14]}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    # for product_link in product_links:
    s.headers['x-amz-acp-params'] = "tok=0DV5j8DDJsH8JQfdVFxJFD3p6AZraMOZTik-kgzNi08;ts=1619674837835;rid=ER1GSMM13VTETPS90K43;d1=251;d2=0;tpm=CGHBD;ref=rtpb"
    res = s.post(url,json=payload)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("li.a-carousel-card-fragment > a.a-link-normal > div[data-rows]"):
        print(item.text)
Run Code Online (Sandbox Code Playgroud)

如何在customers who bought没有硬编码有效负载的情况下从部分中抓取所有书籍?

bad*_*ker 8

当您查询产品 URL 时,获取轮播数据所需的一切都在初始请求中。

您需要获取完整的 product HTML,提取轮播数据并重用其中的部分以构建可用于后续POST请求的有效负载。

但是,获得产品HTML是最困难的部分,至少在我看来Amazon,如果您要求HTML太频繁,要么会阻止或抛出 CAPTCHA 。

使用代理或 VPN 会有所帮助。交换产品 URL 有时也有帮助。

总结起来,关键是拿到产品HTML。AFAIK,后续请求很容易发出并且不会受到限制。

以下是如何从轮播中获取数据:

import json
import re

import requests
from bs4 import BeautifulSoup


# The chunk is how many carousel items are going to be requested for;
# this can vary from 4 - 10 items, as on the web-page.
# Also, the other list is used as the indexes key in the payload.
def get_idx_and_indexes(carousel_ids: list, chunk: int = 5) -> iter:
    for index in range(0, len(carousel_ids), chunk):
        tmp = carousel_ids[index:index + chunk]
        yield tmp, [carousel_ids.index(item) for item in tmp]


headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/90.0.4430.93 Safari/537.36",
}

product_url = 'https://www.amazon.de/Rust-Programming-Language-Covers-2018/dp/1718500440/'
# Getting the product HTML as it carries all the carousel data items 
with requests.Session() as session:
    r = session.get("https://www.amazon.com", headers=headers)
    page = session.get(product_url, headers=headers)

# This is where the carousel data sits along with all the items needed to make
# the following requests e.g. items, acp-params, linkparameters, marketplaceid etc.
initial_soup = BeautifulSoup(
    re.search(r"<!--CardsClient-->(.*)<input", page.text).group(1),
    "lxml",
).find_all("div")

# Preparing all the details for subsequent requests to carousel_endpoint
item_ids = json.loads(initial_soup[3]["data-a-carousel-options"])["ajax"]["id_list"]
payload = {
    "aAjaxStrategy": "promise",
    "aCarouselOptions": initial_soup[3]["data-a-carousel-options"],
    "aDisplayStrategy": "swap",
    "aTransitionStrategy": "swap",
    "faceoutkataname": "GeneralFaceout",
    "faceoutspecs": "{}",
    "individuals": "0",
    "language": "en-US",
    "linkparameters": initial_soup[0]["data-acp-tracking"],
    "marketplaceid": initial_soup[3]["data-marketplaceid"],
    "name": "p13n-sc-shoveler_hgm4oj1hneo",  # this changes by can be ignored
    "offset": "6",
    "reftagprefix": "pd_sim",
}

headers.update(
    {
        "x-amz-acp-params": initial_soup[0]["data-acp-params"],
        "x-requested-with": "XMLHttpRequest",
    }
)

# looping through the carousel data and performing requests
carousel_endpoint = " https://www.amazon.com/acp/p13n-desktop-carousel/funjjvdbohwkuezi/getCarouselItems"
for ids, indexes in get_idx_and_indexes(item_ids):
    payload["ids"] = ids
    payload["indexes"] = indexes
    # The actual carousel data
    response = session.post(carousel_endpoint, json=payload, headers=headers)
    carousel = BeautifulSoup(response.text, "lxml").find_all("a")
    print("\n".join(a.getText() for a in carousel))
Run Code Online (Sandbox Code Playgroud)

这应该输出:

Cracking the Coding Interview: 189 Programming Questions and Solutions
Gayle Laakmann McDowell
4.7 out of 5 starsâ4,864
#1 Best Sellerin Computer Hacking
$24.00

Container Security: Fundamental Technology Concepts that Protect Containerized Applications
Liz Rice
4.7 out of 5 starsâ102
$35.42

Linux Bible
Christopher Negus
4.8 out of 5 starsâ245
#1 Best Sellerin Linux Servers
$31.99

System Design Interview â An insider's guide, Second Edition
Alex Xu
4.5 out of 5 starsâ568
#1 Best Sellerin Bioinformatics
$24.99

Ansible for DevOps: Server and configuration management for humans
Jeff Geerling
4.6 out of 5 starsâ127
$17.35

Effective C: An Introduction to Professional C Programming
Robert C. Seacord
4.5 out of 5 starsâ94
$32.99

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
Aurélien Géron
4.8 out of 5 starsâ1,954
#1 Best Sellerin Computer Neural Networks
$32.93

Head First Design Patterns: Building Extensible and Maintainable Object-Oriented Software
Eric Freeman
4.7 out of 5 starsâ67
$41.45

Fluent Python: Clear, Concise, and Effective Programming
Luciano Ramalho
4.6 out of 5 starsâ523
54 offers from $32.24

TCP/IP Illustrated, Volume 1: The Protocols (Addison-Wesley Professional Computing Series)
4.6 out of 5 starsâ199
$63.26

Operating Systems: Three Easy Pieces
4.7 out of 5 starsâ224
#1 Best Sellerin Computer Operating Systems Theory
$24.61

Software Engineering at Google: Lessons Learned from Programming Over Time
Titus Winters
4.6 out of 5 starsâ243
$44.52

and so on ...
Run Code Online (Sandbox Code Playgroud)