使用scrapy点击网站上的按钮

Question

使用scrapy点击网站上的按钮

beb*_*boy 3 python selenium web-crawler scrapy

我想问一下（爬行）从这个站点点击下一步按钮（更改网站的编号页面）（然后爬行更多直到页码末尾）如何

我尝试将刮削与硒结合使用，但它仍然出错并说 "line 22 self.driver = webdriver.Firefox() ^ IndentationError: expected an indented block"

我不知道为什么会这样，我觉得我的代码很好。有人能解决这个问题吗？

这是我的来源：

from selenium import webdriver
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from now.items import NowItem
class MySpider(BaseSpider):
name = "nowhere"
allowed_domains = ["n0where.net"]
start_urls = ["https://n0where.net/"]

def parse(self, response):
    for article in response.css('.loop-panel'):
        item = NowItem()
        item['title'] = article.css('.article-title::text').extract_first()
        item['link'] = article.css('.loop-panel>a::attr(href)').extract_first()
        item['body'] ='' .join(article.css('.excerpt p::text').extract()).strip()
        #item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
        yield item

def __init__(self):
    self.driver = webdriver.Firefox()

    def parse2(self, response):
    self.driver.get(response.url)

    while True:
        next = self.driver.find_element_by_xpath('/html/body/div[4]/div[3]/div/div/div/div/div[1]/div/div[6]/div/a[8]/span')

        try:
            next.click()

            # get the data and write it to scrapy items
        except:
            break

    self.driver.close()`

Run Code Online (Sandbox Code Playgroud)

这是我对我的程序伙伴的捕获：

Answer 1

Gra*_*rus 5

忽略语法和缩进错误，您的代码逻辑通常会出现问题。

你所做的是创建 webdriver 并且从不使用它。你的蜘蛛在这里做的是：

创建 webdriver 对象。
为中的每个 url 安排一个请求self.start_urls，在您的情况下它只有一个。
下载它，制作Response对象并将其传递给self.parse()
您的 parse 方法似乎找到了一些 xpaths 并制作了一些项目，因此scrapy 会为您提供一些已找到的项目（如果有的话）
完毕

您的 parse2 从未被调用过，因此您的 selenium webdriver 从未被使用过。

由于在这种情况下您没有使用scrapy下载任何内容，因此您可以覆盖start_requests()蜘蛛的（<-这是您的蜘蛛开始的地方）方法来完成整个逻辑。

就像是：

from selenium import webdriver
import scrapy
from scrapy import Selector


class MySpider(scrapy.Spider):
    name = "nowhere"
    allowed_domains = ["n0where.net"]
    start_url = "https://n0where.net/"

    def start_requests(self):
        driver = webdriver.Firefox()
        driver.get(self.start_url)
        while True:
            next_url = driver.find_element_by_xpath(
                '/html/body/div[4]/div[3]/div/div/div/div/div[1]/div/div[6]/div/a[8]/span')
            try:
                # parse the body your webdriver has
                self.parse(driver.page_source)
                # click the button to go to next page 
                next_url.click()
            except:
                break
        driver.close()

    def parse(self, body):
        # create Selector from html string
        sel = Selector(text=body)
        # parse it
        for article in sel.css('.loop-panel'):
            item = dict()
            item['title'] = article.css('.article-title::text').extract_first()
            item['link'] = article.css('.loop-panel>a::attr(href)').extract_first()
            item['body'] = ''.join(article.css('.excerpt p::text').extract()).strip()
            # item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
            yield item

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，3 月前
查看次数：	5886 次
最近记录：	9 年，3 月前