无法使我的脚本以正确的方式处理本地创建的服务器响应

Question

无法使我的脚本以正确的方式处理本地创建的服务器响应

rob*_*txt 6 python scrapy web-scraping flask python-3.x

我已经使用脚本在本地运行Selenium，以便可以利用我的Spider中的响应（来自Selenium）。

这是selenium在本地运行的Web服务：

from flask import Flask, request, make_response
from flask_restful import Resource, Api
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

app = Flask(__name__)
api = Api(app)

class Selenium(Resource):
    _driver = None

    @staticmethod
    def getDriver():
        if not Selenium._driver:
            chrome_options = Options()
            chrome_options.add_argument("--headless")

            Selenium._driver = webdriver.Chrome(options=chrome_options)
        return Selenium._driver

    @property
    def driver(self):
        return Selenium.getDriver()

    def get(self):
        url = str(request.args['url'])

        self.driver.get(url)

        return make_response(self.driver.page_source)

api.add_resource(Selenium, '/')

if __name__ == '__main__':
    app.run(debug=True)

Run Code Online (Sandbox Code Playgroud)

这是我的scrap脚蜘蛛，它利用该响应从网页中解析标题。

import scrapy
from urllib.parse import quote
from scrapy.crawler import CrawlerProcess

class StackSpider(scrapy.Spider):
    name = 'stackoverflow'
    url = '/sf/ask/tagged/web-scraping/?sort=newest&pageSize=50'
    base = 'https://stackoverflow.com'

    def start_requests(self):
        link = 'http://127.0.0.1:5000/?url={}'.format(quote(self.url))
        yield scrapy.Request(link,callback=self.parse)

    def parse(self, response):
        for item in response.css(".summary .question-hyperlink::attr(href)").getall():
            nlink = self.base + item
            link = 'http://127.0.0.1:5000/?url={}'.format(quote(nlink))
            yield scrapy.Request(link,callback=self.parse_info,dont_filter=True)

    def parse_info(self, response):
        item = response.css('h1[itemprop="name"] > a::text').get()
        yield {"title":item}

if __name__ == '__main__':
    c = CrawlerProcess()
    c.crawl(StackSpider)
    c.start()

Run Code Online (Sandbox Code Playgroud)

问题是上面的脚本多次给我相同的标题，然后又给了我另一个标题，依此类推。

我应该带来什么麻烦才能使脚本以正确的方式工作？

Answer 1

ASH*_*Hu2 4

我运行了这两个脚本，它们按预期运行。所以我的发现：

downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError未经服务器许可，无法解决此错误，这里是 eBay。
来自 scrapy 的日志：

2019-05-25 07:28:41 [scrapy.statscollectors] 信息：转储 Scrapy 统计信息：{'downloader/exception_count'：72，'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError'：64，'downloader/exception_type_count /twisted.web._newclient.ResponseNeverReceived'：8、'下载器/request_bytes'：55523、'下载器/request_count'：81、'下载器/request_method_count/GET'：81、'下载器/response_bytes'：2448476、'下载器/response_count ': 9, 'downloader/response_status_count/200': 9, 'finish_reason': '关机', 'finish_time': datetime.datetime(2019, 5, 25, 1, 58, 41, 234183), 'item_scraped_count': 8 、“log_count/DEBUG”：90、“log_count/INFO”：9、“request_深度_max”：1、“response_received_count”：9、“重试/计数”：72、“重试/reason_count/twisted.internet.error.ConnectionRefusedError” ：64，'重试/reason_count/twisted.web._newclient.ResponseNeverReceived'：8，'调度程序/出队'：81，'调度程序/出队/内存'：81，'调度程序/入队'：131，'调度程序/入队/内存': 131, 'start_time': datetime.datetime(2019, 5, 25, 1, 56, 57, 751009)} 2019-05-25 07:28:41 [scrapy.core.engine] 信息: 蜘蛛已关闭 (关闭）

您只能看到8已刮掉的项目。这些只是徽标和其他不受限制的东西。

Server Log:

s:// .ebaystatic.com http:// .ebay.com https://*.ebay.com"。要么是“unsafe-inline”关键字，要么是哈希值（“sha256-40GZDfucnPVwbvI/Q1ivGUuJtX8krq8jy3tWNrA/n58=”），或者需要一个随机数（'nonce-...'）来启用内联执行。”，来源：https ://vi.vipr.ebaydesc.com/ws/eBayISAPI.dll?ViewItemDescV4&item=323815597324&t=0&tid=10&category= 169291&卖家=wardrobe-ltd&excSoj=1&excTrk=1&lsite=0&ittenable=false&domain=ebay.com&descgauge=1&cspheader=1&oneClk=1&secureDesc=1 (1 )

eBay 不允许您自行抓取。

那么如何完成你的任务>>

每次在抓取之前检查/robots.txt同一站点。对于 ebay 来说： http: //www.ebay.com/robots.txt 你可以看到几乎所有的事情都是不允许的。

用户代理：* 不允许：/*rt=nc 不允许：/b/ LH_ 不允许：/brw/ 不允许：/clp/ 不允许：/clt/store/ 不允许：/csc/ 不允许：/ctg/ 不允许：/ctm/不允许：/dsc/ 不允许：/edc/ 不允许：/feed/ 不允许：/gsr/ 不允许：/gwc/ 不允许：/hcp/ 不允许：/itc/ 不允许：/lit/ 不允许：/lst/ng/ 不允许：/ lvx/ 不允许：/mbf/ 不允许：/mla/ 不允许：/mlt/ 不允许：/myb/ 不允许：/mys/ 不允许：/prp/ 不允许：/rcm/ 不允许：/sch/ % 7C 不允许：/sch/ LH_不允许：/sch/aop/ 不允许：/sch/ctg/ 不允许：/sl/node 不允许：/sme/ 不允许：/soc/ 不允许：/talk/ 不允许：/tickets/ 不允许：/today/ 不允许：/trylater/不允许：/urw/write-review/ 不允许：/vsp/ 不允许：/ws/ 不允许：/sch/ modules=SEARCH_REFINMENTS_MODEL_V2 不允许：/b/ modules=SEARCH_REFINMENTS_MODEL_V2 不允许：/itm/ _nkw 不允许：/itm/ ?fits 不允许： /itm/ &fits 不允许：/cta/
因此，请访问https://developer.ebay.com/api-docs/developer/static/developer-landing.html并检查他们的文档，他们的网站中有更简单的示例代码，可以在不抓取的情况下获取所需的项目。

归档时间：	6 年，7 月前
查看次数：	323 次
最近记录：	6 年，7 月前