相关疑难解决方法(0)

如何在scrapy spider中传递用户定义的参数

我试图将用户定义的参数传递给scrapy的蜘蛛.任何人都可以建议如何做到这一点？

我在-a某处读到了一个参数,但不知道如何使用它.

python web-crawler scrapy

L L*_*iet

2019 01-13

91
推荐指数

4
解决办法

5万
查看次数

如何在Python脚本中运行Scrapy

我是Scrapy的新手,我正在寻找一种从Python脚本运行它的方法.我找到了两个解释这个的来源:

http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/

http://snipplr.com/view/67006/using-scrapy-from-a-script/

我无法弄清楚我应该在哪里放置我的蜘蛛代码以及如何从主函数中调用它.请帮忙.这是示例代码:

# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. 
# 
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
# 
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet. 

#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the …

Run Code Online (Sandbox Code Playgroud)

python web-crawler scrapy web-scraping

use*_*954

2012 11-18

54
推荐指数

5
解决办法

5万
查看次数

如何为scrapy提供URL进行爬行？

我想使用scrapy来抓取网页.有没有办法从终端本身传递起始URL？

在文档中给出了可以给出蜘蛛的名称或URL,但是当我给出url时它会抛出一个错误:

//我的蜘蛛的名字就是例子,但是我给的是url而不是我的蜘蛛名字(如果我给蜘蛛名字,它可以正常工作).

scrapy crawl example.com

错误:

文件"/usr/local/lib/python2.7/dist-packages/Scrapy-0.14.1-py2.7.egg/scrapy/spidermanager.py",第43行,在create raise KeyError("未找到蜘蛛:% s"%spider_name"KeyError:'找不到蜘蛛:example.com'

如何让scrapy在终端上给出的url上使用我的蜘蛛？

web-crawler scrapy

G G*_*ill

lucky-day

25
推荐指数

4
解决办法

2万
查看次数

Multiprocessing of Scrapy Spiders in Parallel Processes

There as several similar questions that I have already read on Stack Overflow. Unfortunately, I lost links of all of them, because my browsing history got deleted unexpectedly.

All of the above questions, couldn't help me. Either, some of them have used CELERY or some of them SCRAPYD, and I want to use the MULTIPROCESSISNG Library. Also, the Scrapy Official Documentation shows how to run multiple spiders on a SINGLE PROCESS, not on MULTIPLE PROCESSES.

None of them couldn't help …

python scrapy web-scraping scrapy-spider python-multiprocessing

Ash*_*boo

2015 06-27

7
推荐指数

1
解决办法

4393
查看次数

在python脚本中将参数传递给scrapy spider

我可以使用wiki中的以下配方在python脚本中运行爬网:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings

spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()

Run Code Online (Sandbox Code Playgroud)

正如你所看到的,我可以传递domain给FollowAllSpider我,但我的问题是我如何使用上面的代码将start_urls(实际上date将被添加到Fixed url中)传递给我的蜘蛛类？

这是我的蜘蛛类:

class MySpider(CrawlSpider):
    name = 'tw'
    def __init__(self,date):
        y,m,d=date.split('-') #this is a test , it could split with regex! 
        try:
            y,m,d=int(y),int(m),int(d)

        except ValueError:
            raise 'Enter a valid date'

        self.allowed_domains = ['mydomin.com']
        self.start_urls …

Run Code Online (Sandbox Code Playgroud)

python scrapy web-scraping python-2.7 scrapy-spider

Kas*_*mvd

2015 02-25

6
推荐指数

1
解决办法

4833
查看次数