如何从随机网站上刮掉所有产品?

Caj*_*uu' 9 python lxml web-scraping python-3.x

我试图从这个网站获得所有产品,但不知何故我不认为我选择了最好的方法,因为其中一些缺失了,我无法弄清楚为什么.这不是我第一次遇到困难时.

我现在这样做的方式是这样的:

  • 转到网站的索引页面
  • 从那里获得所有类别(AZ 0-9)
  • 访问上述每个类别,并从那里递归遍历所有子类别,直到我到达产品页面
  • 当我到达产品页面时,检查产品是否有更多SKU.如果有,请获取链接.否则,这是唯一的SKU.

现在,下面的代码可以工作,但它没有得到所有的产品,我没有看到为什么它跳过一些的任何原因.也许我接近一切的方式是错误的.

from lxml import html
from random import randint
from string import ascii_uppercase
from time import sleep
from requests import Session


INDEX_PAGE = 'https://www.richelieu.com/us/en/index'
session_ = Session()


def retry(link):
    wait = randint(0, 10)
    try:
        return session_.get(link).text
    except Exception as e:
        print('Retrying product page in {} seconds because: {}'.format(wait, e))
        sleep(wait)
        return retry(link)


def get_category_sections():
    au = list(ascii_uppercase)
    au.remove('Q')
    au.remove('Y')
    au.append('0-9')
    return au


def get_categories():
    html_ = retry(INDEX_PAGE)
    page = html.fromstring(html_)
    sections = get_category_sections()

    for section in sections:
        for link in page.xpath("//div[@id='index-{}']//li/a/@href".format(section)):
            yield '{}?imgMode=m&sort=&nbPerPage=200'.format(link)


def dig_up_products(url):
    html_ = retry(url)
    page = html.fromstring(html_)

    for link in page.xpath(
            '//h2[contains(., "CATEGORIES")]/following-sibling::*[@id="carouselSegment2b"]//li//a/@href'
    ):
        yield from dig_up_products(link)

    for link in page.xpath('//ul[@id="prodResult"]/li//div[@class="imgWrapper"]/a/@href'):
        yield link

    for link in page.xpath('//*[@id="ts_resultList"]/div/nav/ul/li[last()]/a/@href'):
        if link != '#':
            yield from dig_up_products(link)


def check_if_more_products(tree):
    more_prods = [
        all_prod
        for all_prod in tree.xpath("//div[@id='pm2_prodTableForm']//tbody/tr/td[1]//a/@href")
    ]
    if not more_prods:
        return False
    return more_prods


def main():
    for category_link in get_categories():
        for product_link in dig_up_products(category_link):
            product_page = retry(product_link)
            product_tree = html.fromstring(product_page)
            more_products = check_if_more_products(product_tree)
            if not more_products:
                print(product_link)
            else:
                for sku_product_link in more_products:
                    print(sku_product_link)


if __name__ == '__main__':
    main()
Run Code Online (Sandbox Code Playgroud)

现在,问题可能过于笼统,但我想知道当有人想要从网站获取所有数据(在这种情况下是产品)时,是否遵循经验法则.有人可以带我完成发现这样一个场景的最佳方法的整个过程吗?

Aja*_*234 5

如果您的最终目标是为每个类别划分整个产品列表,那么在索引页面上定位每个类别的完整产品列表可能是有意义的.此程序使用BeautifulSoup查找索引页面上的每个类别,然后遍历每个类别下的每个产品页面.最终输出是namedtuple每个类别名称的s故事列表,其中包含当前页面链接和每个链接的完整产品标题:

url = "https://www.richelieu.com/us/en/index"
import urllib
import re
from bs4 import BeautifulSoup as soup
from collections import namedtuple
import itertools
s = soup(str(urllib.urlopen(url).read()), 'lxml')
blocks = s.find_all('div', {'id': re.compile('index\-[A-Z]')})
results_data = {[c.text for c in i.find_all('h2', {'class':'h1'})][0]:[b['href'] for b in i.find_all('a', href=True)] for i in blocks}
final_data = []
category = namedtuple('category', 'abbr, link, products')
for category1, links in results_data.items():
   for link in links:
      page_data = str(urllib.urlopen(link).read())
      print "link: ", link
      page_links = re.findall(';page\=(.*?)#results">(.*?)</a>', page_data)
      if not page_links:
         final_page_data = soup(page_data, 'lxml')
         final_titles = [i.text for i in final_page_data.find_all('h3', {'class':'itemHeading'})]
         new_category = category(category1, link, final_titles)
         final_data.append(new_category)

      else:
         page_numbers = set(itertools.chain(*list(map(list, page_links))))

         full_page_links = ["{}?imgMode=m&sort=&nbPerPage=48&page={}#results".format(link, num) for num in page_numbers]
         for page_result in full_page_links:
            new_page_data = soup(str(urllib.urlopen(page_result).read()), 'lxml')
            final_titles = [i.text for i in new_page_data.find_all('h3', {'class':'itemHeading'})]
            new_category = category(category1, link, final_titles)
            final_data.append(new_category)

print final_data
Run Code Online (Sandbox Code Playgroud)

输出将以以下格式获得结果:

[category(abbr=u'A', link='https://www.richelieu.com/us/en/category/tools-and-shop-supplies/workshop-accessories/tool-accessories/sander-accessories/1058847', products=[u'Replacement Plate for MKT9924DB Belt Sander', u'Non-Grip Vacuum Pads', u'Sandpaper Belt 2\xbd " x 14" for Compact Belt Sander PC371 or PC371K', u'Stick-on Non-Vacuum Pads', u'5" Non-Vacuum Disc Pad Hook-Face', u'Sanding Filter Bag', u'Grip-on Vacuum Pads', u'Plates for Non-Vacuum (Grip-On) Dynabug II Disc Pads - 7.62 cm x 10.79 cm (3" x 4-1/4")', u'4" Abrasive for Finishing Tool', u'Sander Backing Pad for RO 150 Sander', u'StickFix Sander Pad for ETS 125 Sander', u'Sub-Base Pad for Stocked Sanders', u'(5") Non-Vacuum Disc Pad Vinyl-Face', u'Replacement Sub-Base Pads for Stocked Sanders', u"5'' Multi-Hole Non-Vaccum Pad", u'Sander Backing Pad for RO 90 DX Sander', u'Converting Sanding Pad', u'Stick-On Vacuum Pads', u'Replacement "Stik It" Sub Base', u'Drum Sander/Planer Sandpaper'])....
Run Code Online (Sandbox Code Playgroud)

要访问每个属性,请按以下方式调用:

categories = [i.abbr for i in final_data]
links = [i.links for i in final_data]
products = [i.products for i in final_data]
Run Code Online (Sandbox Code Playgroud)

我相信使用的好处BeautifulSoup是这个例子是它提供了更高级别的控制,并且很容易修改.例如,如果OP改变了他想要抓取的产品/索引的哪些方面的想法,find_all则只需要对参数进行简单的更改,因为上面代码的一般结构以索引页面中的每个产品类别为中心. .