beb*_*boy 3 python selenium web-crawler scrapy
我想问一下(爬行)从这个站点点击下一步按钮(更改网站的编号页面)(然后爬行更多直到页码末尾)如何
我尝试将刮削与硒结合使用,但它仍然出错并说 "line 22
self.driver = webdriver.Firefox()
^
IndentationError: expected an indented block"
我不知道为什么会这样,我觉得我的代码很好。有人能解决这个问题吗?
这是我的来源:
from selenium import webdriver
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from now.items import NowItem
class MySpider(BaseSpider):
name = "nowhere"
allowed_domains = ["n0where.net"]
start_urls = ["https://n0where.net/"]
def parse(self, response):
for article in response.css('.loop-panel'):
item = NowItem()
item['title'] = article.css('.article-title::text').extract_first()
item['link'] = article.css('.loop-panel>a::attr(href)').extract_first()
item['body'] ='' .join(article.css('.excerpt p::text').extract()).strip()
#item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
yield item
def __init__(self):
self.driver = webdriver.Firefox()
def parse2(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_xpath('/html/body/div[4]/div[3]/div/div/div/div/div[1]/div/div[6]/div/a[8]/span')
try:
next.click()
# get the data and write it to scrapy items
except:
break
self.driver.close()`
Run Code Online (Sandbox Code Playgroud)
忽略语法和缩进错误,您的代码逻辑通常会出现问题。
你所做的是创建 webdriver 并且从不使用它。你的蜘蛛在这里做的是:
self.start_urls,在您的情况下它只有一个。Response对象并将其传递给self.parse()您的 parse2 从未被调用过,因此您的 selenium webdriver 从未被使用过。
由于在这种情况下您没有使用scrapy下载任何内容,因此您可以覆盖start_requests()蜘蛛的(<-这是您的蜘蛛开始的地方)方法来完成整个逻辑。
就像是:
from selenium import webdriver
import scrapy
from scrapy import Selector
class MySpider(scrapy.Spider):
name = "nowhere"
allowed_domains = ["n0where.net"]
start_url = "https://n0where.net/"
def start_requests(self):
driver = webdriver.Firefox()
driver.get(self.start_url)
while True:
next_url = driver.find_element_by_xpath(
'/html/body/div[4]/div[3]/div/div/div/div/div[1]/div/div[6]/div/a[8]/span')
try:
# parse the body your webdriver has
self.parse(driver.page_source)
# click the button to go to next page
next_url.click()
except:
break
driver.close()
def parse(self, body):
# create Selector from html string
sel = Selector(text=body)
# parse it
for article in sel.css('.loop-panel'):
item = dict()
item['title'] = article.css('.article-title::text').extract_first()
item['link'] = article.css('.loop-panel>a::attr(href)').extract_first()
item['body'] = ''.join(article.css('.excerpt p::text').extract()).strip()
# item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
yield item
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
5886 次 |
| 最近记录: |