Phi*_*hil 1 python scrapy web-scraping
我试图从亚马逊抓取产品信息,但遇到了问题。当蜘蛛到达页面末尾时它会停止,我想为我的程序添加一种方法来一般搜索页面的下 3 页。我正在尝试编辑 start_urls,但我无法从函数解析内部执行此操作。此外,这没什么大不了的,但程序出于某种原因两次请求相同的信息。提前致谢。
import scrapy
from scrapy import Spider
from scrapy import Request
class ProductSpider(scrapy.Spider):
product = input("What product are you looking for? Keywords help for specific products: ")
name = "Product_spider"
allowed_domains=['www.amazon.ca']
start_urls = ['https://www.amazon.ca/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords='+product]
#so that websites will not block access to the spider
download_delay = 30
def parse(self, response):
temp_url_list = []
for i in range(3,6):
next_url = response.xpath('//*[@id="pagn"]/span['+str(i)+']/a/@href').extract()
next_url_final = response.urljoin(str(next_url[0]))
start_urls.append(str(next_url_final))
# xpath is similar to an address that is used to find certain elements in HTML code,this info is then extracted
product_title = response.xpath('//*/div/div/div/div[2]/div[1]/div[1]/a/@title').extract()
product_price = response.xpath('//span[contains(@class,"s-price")]/text()').extract()
product_url = response.xpath('//*/div/div/div/div[2]/div[1]/div[1]/a/@href').extract()
# yield goes through everything once, saves its spot, does not save info but sends it to the pipeline to get processed if need be
yield{'product_title': product_title, 'product_price': product_price, 'url': product_url,}
# repeating the same process on concurrent pages
#it is checking the same url, no generality, need to find, maybe just do like 5 pages, also see if you can have it sort from high to low and find match with certain amount of key words
Run Code Online (Sandbox Code Playgroud)
你误解了scrapy在这里是如何工作的。
Scrapy 期望您的蜘蛛生成(产生)scrapy.Request 对象或 scrapy.Item/dictionary 对象。当您的蜘蛛启动时,它会从中获取网址start_urls并scrapy.Request为每个网址生成一个:
def start_request(self, parse):
for url in self.start_urls:
yield scrapy.Request(url)
Run Code Online (Sandbox Code Playgroud)
所以start_urls一旦蜘蛛启动,你的改变不会改变任何东西。
但是,您可以做的只是scrapy.Requests在您的parse()方法中产生更多!
def parse(self, response):
urls = response.xpath('//a/@href').extract()
for url in urls:
yield scrapy.Request(url, self.parse2)
def parse2(self, response):
# new urls!
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1563 次 |
| 最近记录: |