在尝试了 scrapy 的第一个教程后,我真的很兴奋。所以我也想尝试表单提交。
我有以下脚本,如果我打印出 response.body,我将返回带有表单的页面,但什么也没发生。任何人都可以帮助我如何进入结果页面?
# spiders/holidaytaxi.py
import scrapy
from scrapy.http import Request, FormRequest
from scrapy.selector import HtmlXPathSelector, Selector
class HolidaytaxiSpider(scrapy.Spider):
name = "holidaytaxi"
allowed_domains = ["holidaytaxis.com"]
start_urls = ['http://holidaytaxis.com/en']
def parse(self, response):
return [FormRequest.from_response(
response,
formdata={
'bookingtypeid':'Return',
'airpotzgroupid_chosen':'Turkey',
'pickup_chosen':'Antalya Airport',
'dropoff_chosen':'Alanya',
'arrivaldata':'12-07-2015',
'arrivalhour':'12',
'arrivalmin':'00',
'departuredata':'14-07-2015',
'departurehour':'12',
'departuremin':'00',
'adults':'2',
'children':'0',
'infants':'0'
},
callback=self.parseResponse
)]
def parseResponse(self, response):
print "Hello World"
print response.status
print response
heading = response.xpath('//div/h2')
print "heading: ", heading
Run Code Online (Sandbox Code Playgroud)
输出是:
2015-07-05 16:23:59 [scrapy] DEBUG: Telnet console listening on …Run Code Online (Sandbox Code Playgroud) 当我尝试抓取某个网站(同时使用蜘蛛和外壳)时,出现以下错误:
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.>]
Run Code Online (Sandbox Code Playgroud)
我发现当没有设置用户代理时会发生这种情况。但是手动设置后,我仍然遇到同样的错误。
你可以在这里看到scrapy shell的整个输出:http ://pastebin.com/ZFJZ2UXe
我没有代理,我可以通过scrapy shell访问其他站点而没有问题。我也可以使用 Chrome 访问该站点,因此这不是网络或连接问题。
也许有人可以给我一个提示,我该如何解决这个问题?
我正在尝试抓取 Booking.Com。蜘蛛打开和关闭而不打开和抓取 url。[输出][1] [1]:https : //i.stack.imgur.com/9hDt6.png 我是 python 和 Scrapy 的新手。这是我到目前为止编写的代码。请指出我做错了什么。
import scrapy
import urllib
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.loader import ItemLoader
from CinemaScraper.items import CinemascraperItem
class trip(CrawlSpider):
name="tripadvisor"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
reviewsurl = response.xpath('//a[@class="show_all_reviews_btn"]/@href')
url = response.urljoin(reviewsurl[0].extract())
self.pageNumber = 1
return scrapy.Request(url, callback=self.parse_reviews)
def parse_reviews(self, response):
for rev in response.xpath('//li[starts-with(@class,"review_item")]'):
item =CinemascraperItem()
#sometimes …Run Code Online (Sandbox Code Playgroud) 我正在用 Scrapy 抓取列表。我的脚本首先使用 解析列表网址parse_node,然后使用解析每个列表parse_listing,对于每个列表,它使用 解析列表的代理parse_agent。我想创建一个数组,该数组通过列表和列表的代理进行scrapy解析,并为每个新列表进行重置。
这是我的解析脚本:
def parse_node(self,response,node):
yield Request('LISTING LINK',callback=self.parse_listing)
def parse_listing(self,response):
yield response.xpath('//node[@id="ListingId"]/text()').extract_first()
yield response.xpath('//node[@id="ListingTitle"]/text()').extract_first()
for agent in string.split(response.xpath('//node[@id="Agents"]/text()').extract_first() or "",'^'):
yield Request('AGENT LINK',callback=self.parse_agent)
def parse_agent(self,response):
yield response.xpath('//node[@id="AgentName"]/text()').extract_first()
yield response.xpath('//node[@id="AgentEmail"]/text()').extract_first()
Run Code Online (Sandbox Code Playgroud)
我希望 parse_listing 导致:
{
'id':123,
'title':'Amazing Listing'
}
Run Code Online (Sandbox Code Playgroud)
然后 parse_agent 添加到列表数组:
{
'id':123,
'title':'Amazing Listing'
'agent':[
{
'name':'jon doe',
'email:'jon.doe@email.com'
},
{
'name':'jane doe',
'email:'jane.doe@email.com'
}
]
}
Run Code Online (Sandbox Code Playgroud)
如何从每个级别获取结果并建立一个数组?
是否可以将 Scrapy 蜘蛛配置为忽略访问过的 URL 中的 URL 参数,以便在已经访问www.example.com/page?p=value2过的情况下不会被访问www.example.com/page?p=value1?
我有一个运行良好的爬虫蜘蛛:
`# -*- coding: utf-8 -*-
import scrapy
class AllCategoriesSpider(scrapy.Spider):
name = 'vieles'
allowed_domains = ['examplewiki.de']
start_urls = ['http://www.exampleregelwiki.de/index.php/categoryA.html','http://www.exampleregelwiki.de/index.php/categoryB.html','http://www.exampleregelwiki.de/index.php/categoryC.html',]
#"Titel": :
def parse(self, response):
urls = response.css('a.ulSubMenu::attr(href)').extract() # links to den subpages
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url,callback=self.parse_details)
def parse_details(self,response):
yield {
"Titel": response.css("li.active.last::text").extract(),
"Content": response.css('div.ce_text.first.last.block').extract(),
}
Run Code Online (Sandbox Code Playgroud)
` 与
scrapy runningpider spider.py -o dat.json 它将所有信息保存到 dat.json
我希望每个起始 url 都有一个输出文件 categoryA.json categoryB.json 等等。
一个类似的问题没有得到解答,我无法重现这个答案,我也无法从那里的建议中学习。
我如何实现拥有多个输出文件的目标,每个 starturl 一个?我只想运行一个命令/shellscript/文件来实现这一点。
我现在开始使用Scrapy了,我从一个体育页面(足球运动员的名字和团队)中获取了我想要的内容,但是我需要按照链接搜索更多的团队,每个团队页面都有一个链接对于玩家页面,网站链接的结构是:
团队页面:http://esporte.uol.com.br/futebol/clubes/vitoria/ players页面:http://esporte.uol.com.br/futebol/clubes/vitoria/jogadores/
我已经阅读了一些Scrapy教程,我在想团队页面,我必须关注链接,不要解析任何东西,玩家页面我不得跟随并解析玩家,我不知道我是不是我对这个想法是对的,而且语法错了,如果我的跟随想法是错误的,欢迎任何帮助.
这是我的代码:
class MoneyballSpider(BaseSpider):
name = "moneyball"
allowed_domains = ["esporte.uol.com.br", "click.uol.com.br", "uol.com.br"]
start_urls = ["http://esporte.uol.com.br/futebol/clubes/vitoria/jogadores/"]
rules = (
Rule(SgmlLinkExtractor(allow=(r'.*futebol/clubes/.*/', ), deny=(r'.*futebol/clubes/.*/jogadores/', )), follow = True),
Rule(SgmlLinkExtractor(allow=(r'.*futebol/clubes/.*/jogadores/', )), callback='parse', follow = True),
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
jogadores = hxs.select('//div[@id="jogadores"]/div/ul/li')
items = []
for jogador in jogadores:
item = JogadorItem()
item['nome'] = jogador.select('h5/a/text()').extract()
item['time'] = hxs.select('//div[@class="header clube"]/h1/a/text()').extract()
items.append(item)
print item['nome'], item['time']
return items
Run Code Online (Sandbox Code Playgroud) 我已经看到了类似的问题并尝试了答案,但无济于事.有人可以帮我纠正这个吗?
谢谢
hepz@ubuntu:~/Documents/project/project$ scrapy crawl NewsSpider.py
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 4, in <module>
execute()
File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 143, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 89, in _run_print_help
func(*a, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 150, in _run_command
cmd.run(args, opts)
File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/crawl.py", line 58, in run
spider = crawler.spiders.create(spname, **opts.spargs)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermanager.py", line 44, in create
raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: NewsSpider.py'
Run Code Online (Sandbox Code Playgroud) 我可以使用以下命令获得所需的跨度列表:
attrs = soup.find_all("span")
Run Code Online (Sandbox Code Playgroud)
这将返回一个范围列表,作为键和值:
[
<span>back camera resolution</span>,
<span class="even">12 MP</span>
]
[
<span>front camera resolution</span>,
<span class="even">16 MP</span>
]
[
<span>video resolution</span>,
<span class="even"><a class="__cf_email__" data-cfemail="b98b888f89c9f98a89dfc9ca" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="4677767e7636067576203635" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="5067626010616260362023" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script>
</span>
]
Run Code Online (Sandbox Code Playgroud)
原始的HTML是:
为什么“视频分辨率”是这样转换的?
当我运行脚本时,scrapy不断抛出这条消息:
Took longer than 180.0 seconds
Run Code Online (Sandbox Code Playgroud)
导致此问题的原因是什么,以及针对此问题的具体解决方案?
以下是此消息的屏幕截图.

scrapy ×10
scrapy-spider ×10
python ×8
web-scraping ×4
python-2.7 ×3
web-crawler ×2
arrays ×1
python-3.x ×1
scrapy-shell ×1