BeautifulSoup有时会例外

Question

BeautifulSoup有时会例外

use*_*293 5 python beautifulsoup web-crawler html-parsing web-scraping

奇怪的是,有时BeautifulSoup对象确实提供了所需的数据,但有时我得到的错误就像or listindex error或out of rangeor nonetype object does not have attribute findNext(),这是嵌套在其他元素中的数据.

这是代码:

url = 'http://www.computerstore.nl/product/470130/category-208983/asrock-z97-extreme6.html'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)

a = soup.find(text=('Socket')).find_next('dd').string

print(a)

Run Code Online (Sandbox Code Playgroud)

Answer 1

ale*_*cxe 3

实际问题是单元格值并不总是如此Socket，有时它被制表符或空格包围。不检查精确text匹配，而是传递已编译的正则表达式模式：

import re

soup.find(text=re.compile('Socket')).find_next('dd').get_text(strip=True)

Run Code Online (Sandbox Code Playgroud)

总是打印1150。

解释我使用的“有时”这个词（感谢@carpetsmoker 在评论中提出的最初建议）：

如果你打开页面，然后清理cookie并刷新页面，你可能会看到同一页面的两种不同外观：

正如您所看到的，页面上的块的排列方式不同。因此，同一页面有两种不同的外观和 HTML 源代码 - 您看到的是AB 测试技术：

在营销和商业智能中，A/B 测试是随机实验的术语，有两种变体 A 和 B，它们是受控实验中的对照和处理。它是统计假设检验的一种形式，有两种变体，导致统计领域使用的技术术语“双样本假设检验”。

换句话说，他们正在试验产品页面并收集统计数据，例如点击率、销售数量等。

仅供参考，这是我目前的工作代码：

import re

from bs4 import BeautifulSoup
import requests

session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'}
session.get('http://www.computerstore.nl', headers=headers)

response = session.get('http://www.computerstore.nl/product/470130/category-208983/asrock-z97-extreme6.html', headers=headers)
soup = BeautifulSoup(response.content)
print(soup.find(text=re.compile('Socket')).find_next('dd').get_text(strip=True))

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，6 月前
查看次数：	353 次
最近记录：	11 年，6 月前