亚马逊图书 ISBN 的间歇性 BeautifulSoup

Mig*_*ana 3 python beautifulsoup python-requests

我正在尝试收集有关 Amazon 上某些可用书籍的一些信息,但遇到了一个我无法理解的奇怪故障错误。起初我以为是亚马逊阻止了我的连接,但后来我注意到请求有一个“200 OK”,并且它具有相应页面的真实 HTML 内容。

我们以这本书为例:https : //www.amazon.co.uk/All-Rage-Cara-Hunter/dp/0241985110

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

url = 'https://www.amazon.co.uk/All-Rage-Cara-Hunter/dp/0241985110/ref=sr_1_1?crid=2PPCQEJD706VY&dchild=1&keywords=books+bestsellers+2020+paperback&qid=1598132071&sprefix=book%2Caps%2C234&sr=8-1'

response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.content, features="lxml")

price = {}

if soup.select("#buyBoxInner > ul > li > span > .a-text-strike") != []:
    price["regular_price"] = float(
        soup.select("#buyBoxInner > ul > li > span > .a-text-strike")[0].string[1:].replace(",", "."))
    price["promo_price"] = float(soup.select(".offer-price")[0].string[1:].replace(",", "."))
else:
    price["regular_price"] = float(soup.select(".offer-price")[0].string[1:].replace(",", "."))
price["currency"] = soup.select(".offer-price")[0].string[0]
Run Code Online (Sandbox Code Playgroud)

这部分工作正常,我可以有正常价格和促销价格(如果存在),甚至是货币。但是当我这样做时:

isbn = soup.select("td.bucket > .content > ul > li")[4].contents[1].string.strip().replace("-", "")
Run Code Online (Sandbox Code Playgroud)

我得到“IndexError:列表索引超出范围”。但是如果我调试代码,内容其实就在那里!

这是 BeautifulSoup 的错误吗?请求响应是否太长?

And*_*ely 6

似乎亚马逊返回了页面的两个版本。一个 where's<td class="bucket">和一个 where 有几个<span>标签。这个脚本试图从它们两个中提取 ISBN:

import requests
from bs4 import BeautifulSoup


headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

url = 'https://www.amazon.co.uk/All-Rage-Cara-Hunter/dp/0241985110'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, features="lxml")

isbn_10 = soup.select_one('span.a-text-bold:contains("ISBN-10"), b:contains("ISBN-10")').find_parent().text
isbn_13 = soup.select_one('span.a-text-bold:contains("ISBN-13"), b:contains("ISBN-13")').find_parent().text

print(isbn_10.split(':')[-1].strip())
print(isbn_13.split(':')[-1].strip())
Run Code Online (Sandbox Code Playgroud)

印刷:

0241985110
978-0241985113
Run Code Online (Sandbox Code Playgroud)