我是Scrapy的新学习者.我安装了python 2.7和所有其他所需的引擎.
然后我尝试按照教程http://doc.scrapy.org/en/latest/intro/tutorial.html构建一个Scrapy项目.
在抓取步骤中,我键入后scrapy crawl dmoz 生成此错误消息
ImportError: No module named win32api.
[twisted] CRITICAL : Unhandled error in deferred
Run Code Online (Sandbox Code Playgroud)
我正在使用Windows.
堆栈跟踪:
我想获得与此命令行相同的结果:scrapy crawl linkedin_anonymous -a first = James -a last = Bond -o output.json
我的脚本如下:
import scrapy
from linkedin_anonymous_spider import LinkedInAnonymousSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
spider = LinkedInAnonymousSpider(None, "James", "Bond")
process = CrawlerProcess(get_project_settings())
process.crawl(spider) ## <-------------- (1)
process.start()
Run Code Online (Sandbox Code Playgroud)
我发现(1)中的process.crawl()创建了另一个LinkedInAnonymousSpider,其中第一个和最后一个是None(打印在(2)中),如果是这样,那么就没有创建对象蜘蛛的意义了,怎么可能首先传递参数,最后传递给process.crawl()?
linkedin_anonymous:
from logging import INFO
import scrapy
class LinkedInAnonymousSpider(scrapy.Spider):
name = "linkedin_anonymous"
allowed_domains = ["linkedin.com"]
start_urls = []
base_url = "https://www.linkedin.com/pub/dir/?first=%s&last=%s&search=Search"
def __init__(self, input = None, first= None, last=None):
self.input = input # source file name
self.first = …Run Code Online (Sandbox Code Playgroud) 我scrapy在python脚本中运行
def setup_crawler(domain):
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = ArgosSpider(domain=domain)
settings = get_project_settings()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(spider)
crawler.start()
reactor.run()
Run Code Online (Sandbox Code Playgroud)
它成功运行并停止但结果在哪里?我希望结果采用json格式,我该怎么做?
result = responseInJSON
Run Code Online (Sandbox Code Playgroud)
就像我们使用命令一样
scrapy crawl argos -o result.json -t json
Run Code Online (Sandbox Code Playgroud) 我成功地尝试从命令行将我的项目导出到csv文件中,如:
scrapy crawl spiderName -o filename.csv
Run Code Online (Sandbox Code Playgroud)
我的问题是:在代码中执行相同操作的最简单的解决方案是什么?我需要这个,因为我从另一个文件中提取文件名.最后的情景应该是,我打电话
scrapy crawl spiderName
Run Code Online (Sandbox Code Playgroud)
它将项目写入filename.csv
我有个问题.我需要停止一段时间的函数执行,但不要停止整个解析的实现.也就是说,我需要一个非阻塞暂停.
它看起来像:
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
for url in ['url1', 'url2', 'url3', 'more urls']:
yield Request(url, callback=self.second_parse_function)
# Here I need some function for sleep only this function like time.sleep(10)
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
Run Code Online (Sandbox Code Playgroud)
函数non_stop_function需要暂停一段时间,但不应该阻止输出的其余部分.
如果我插入time.sleep()- 它将停止整个解析器,但我不需要它.是否可以使用twisted其他功能停止一个功能?
原因:我需要创建一个非阻塞函数,每n秒解析一次网站页面.在那里,她将获得网址并填写10秒钟.已获取的URL将继续有效,但主要功能需要休眠.
更新:
感谢TkTech和viach.一个答案帮助我理解了如何进行挂起Request,第二个是如何激活它.两个答案相互补充,我为Scrapy做了一个非常好的非阻塞暂停:
def call_after_pause(self, response):
d = Deferred()
reactor.callLater(10.0, d.callback, Request(
'https://example.com/',
callback=self.non_stop_function, …Run Code Online (Sandbox Code Playgroud) 这是我关于堆栈溢出的第一个问题.最近我想使用Linked-in-scraper,所以我下载并指示"scrapy crawl linkedin.com"并获得以下错误消息.为了您的信息,我使用anaconda 2.3.0和python 2.7.11.在执行程序之前,所有相关的包(包括scrapy和6)都会由pip更新.
Traceback (most recent call last):
File "/Users/byeongsuyu/anaconda/bin/scrapy", line 11, in <module>
sys.exit(execute())
File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/cmdline.py", line 108, in execute
settings = get_project_settings()
File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/utils/project.py", line 60, in get_project_settings
settings.setmodule(settings_module_path, priority='project')
File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/__init__.py", line 285, in setmodule
self.set(key, getattr(module, key), priority)
File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/__init__.py", line 260, in set
self.attributes[name].set(value, priority)
File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/__init__.py", line 55, in set
value = BaseSettings(value, priority=priority)
File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/__init__.py", line 91, in __init__
self.update(values, priority)
File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/__init__.py", line 317, in update
for name, …Run Code Online (Sandbox Code Playgroud) 我是scrapy的新手,我正试图抓住宜家网站的网页.基本页面,其中包含此处给出的位置列表.
我的items.py文件如下:
import scrapy
class IkeaItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
Run Code Online (Sandbox Code Playgroud)
而蜘蛛给出如下:
import scrapy
from ikea.items import IkeaItem
class IkeaSpider(scrapy.Spider):
name = 'ikea'
allowed_domains = ['http://www.ikea.com/']
start_urls = ['http://www.ikea.com/']
def parse(self, response):
for sel in response.xpath('//tr/td/a'):
item = IkeaItem()
item['name'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
yield item
Run Code Online (Sandbox Code Playgroud)
在运行文件时,我没有得到任何输出.json文件输出类似于:
[[{"link": [], "name": []}
Run Code Online (Sandbox Code Playgroud)
我要找的输出是位置名称和链接.我一无所获.我哪里错了?
我正在使用一个非常简单的网络刮刀抓取23770个网页scrapy.我对scrapy甚至python都很陌生,但设法编写了一个完成这项工作的蜘蛛.然而,它真的很慢(爬行23770页大约需要28个小时).
我查看了scrapy网页和邮件列表stackoverflow,但我似乎无法找到编写快速爬虫的通用建议,这对于初学者来说是可以理解的.也许我的问题不是蜘蛛本身,而是我运行它的方式.欢迎所有建议!
我已经在下面列出了我的代码,如果需要的话.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import re
class Sale(Item):
Adresse = Field()
Pris = Field()
Salgsdato = Field()
SalgsType = Field()
KvmPris = Field()
Rum = Field()
Postnummer = Field()
Boligtype = Field()
Kvm = Field()
Bygget = Field()
class HouseSpider(BaseSpider):
name = 'House'
allowed_domains = ["http://boliga.dk/"]
start_urls = ['http://www.boliga.dk/salg/resultater?so=1&type=Villa&type=Ejerlejlighed&type=R%%C3%%A6kkehus&kom=&amt=&fraPostnr=&tilPostnr=&iPostnr=&gade=&min=&max=&byggetMin=&byggetMax=&minRooms=&maxRooms=&minSize=&maxSize=&minsaledate=1992&maxsaledate=today&kode=&p=%d' %n for n in xrange(1, 23770, 1)]
def parse(self, response):
hxs = HtmlXPathSelector(response) …Run Code Online (Sandbox Code Playgroud) 我必须从另一个python文件中调用crawler,我使用以下代码.
def crawl_koovs():
spider = SomeSpider()
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
Run Code Online (Sandbox Code Playgroud)
在运行它时,我得到错误
exceptions.ValueError: signal only works in main thread
Run Code Online (Sandbox Code Playgroud)
我能找到的唯一解决方法就是使用
reactor.run(installSignalHandlers=False)
Run Code Online (Sandbox Code Playgroud)
我不想使用,因为我想多次调用此方法,并希望在下一次调用之前停止reactor.我可以做些什么来完成这项工作(可能会强制爬虫在相同的'主'线程中启动)?
我想使用Scrapy从给定的网站获取所有外部链接.使用以下代码,蜘蛛也会抓取外部链接:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from myproject.items import someItem
class someSpider(CrawlSpider):
name = 'crawltest'
allowed_domains = ['someurl.com']
start_urls = ['http://www.someurl.com/']
rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
)
def parse_obj(self,response):
item = someItem()
item['url'] = response.url
return item
Run Code Online (Sandbox Code Playgroud)
我错过了什么?"allowed_domains"是否阻止外部链接被抓取?如果我为LinkExtractor设置"allow_domains",它不会提取外部链接.只是为了澄清:我不想抓取内部链接,但提取外部链接.任何帮助appriciated!
python ×10
scrapy-spider ×10
scrapy ×9
web-scraping ×3
web-crawler ×2
csv ×1
json ×1
performance ×1
scrape ×1
six ×1