如何从耐克网站抓取产品的可用尺码

Pau*_*aul 1 python web-scraping

我正在尝试从耐克产品页面上抓取所有可用尺寸。例如这个页面:

https://www.nike.com/t/air-force-1-07-mens-shoe-JkTGzADv/315122-111

我尝试加载网站并将其写入文本文件,如下所示:

import requests
from bs4 import BeautifulSoup

url = "https://www.nike.com/t/air-force-1-07-mens-shoe-JkTGzADv/315122-111"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soupstirng = str(soup)

with open("website.txt","w") as f:
    f.write(soupstirng)
    f.close()
Run Code Online (Sandbox Code Playgroud)

但我的问题是,创建的文本文件没有加载鞋码的元素。所以我无法从此文件中提取可用大小。我想不出一种方法来检索尺寸。有人知道如何在 python 中检索可用大小吗?

Seb*_*n D 5

尺寸是在页面加载后填充的,这是您看不到它们的原因之一。第二个原因是在使用requests的时候需要使用headers参数才能得到更好的结果。

让我们解决这个问题:

import requests
import json

#Headers are highly recommended
headers = headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:81.0) Gecko/20100101 Firefox/81.0',
    'Accept': 'image/webp,*/*',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}
url = "https://www.nike.com/t/air-force-1-07-mens-shoe-JkTGzADv/315122-111"
page = requests.get(url,headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')

#The web page is populated with data contained in a script tag which we will look for
#It is json data
data = json.loads(soup.find('script',text=re.compile('INITIAL_REDUX_STATE')).text.replace('window.INITIAL_REDUX_STATE=','')[0:-1])

#The Sku we are searching for
product_id = "315122-111"

#In the json file, the following will give us the possible SKUs list
skus = data['Threads']['products'][product_id]['skus']
#And the following their availability
available_skus = data['Threads']['products'][product_id]['availableSkus']

#Let's use pandas to cross both tables
df_skus = pd.DataFrame(skus)
df_available_skus = pd.DataFrame(available_skus)

#Here is finally the table with the available skus and their sizes
df_skus.merge(df_available_skus[['skuId','available']], on ='skuId')
# which can be saved in any format you want (xl, txt, csv, json...)
Run Code Online (Sandbox Code Playgroud)

输出

|       id |   nikeSize | skuId                                |   localizedSize | localizedSizePrefix   | available   |
|---------:|-----------:|:-------------------------------------|----------------:|:----------------------|:------------|
| 10042654 |       12.5 | 118cf6d0-e1c0-50ac-a620-7f3a7f9c0b64 |            47   | EU                    | True        |
| 10042656 |       14   | 0fb2d87f-a7f8-5e36-8961-99c35b0360c1 |            48.5 | EU                    | True        |
| 10042657 |       15   | f80a30b2-8a7c-5834-82c4-9bea2c0c9995 |            49.5 | EU                    | True        |
| 10042658 |       16   | 3e323cdc-1c35-5663-895e-f3f809edff1e |            50.5 | EU                    | True        |
Run Code Online (Sandbox Code Playgroud)