我是Scrapy的新手,我正在尝试"将网页的内容"添加到响应对象中(如果我正确理解的话).
我正在关注http://doc.scrapy.org/en/latest/topics/selectors.html,但它适用于scrapy shell.我想直接使用python代码.
我写了代码来废弃http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
import scrapy
from scrapy.http import HtmlResponse
URL = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'
response = HtmlResponse(url=URL)
print response.selector.xpath('//title/text()')
Run Code Online (Sandbox Code Playgroud)
而输出是
>> []
Run Code Online (Sandbox Code Playgroud)
为什么我不能获得正确的标题价值?似乎HtmlResponse()没有从网上下载数据......为什么?我该怎么办?
非常非常!
帽
当我尝试使用终端在OS X上安装Scrapy时出现错误.
我使用的命令:
sudo pip install -U scrapy
Run Code Online (Sandbox Code Playgroud)
我得到的错误:
Exception:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip-8.1.2-py2.7.egg/pip/basecommand.py", line 215, in main
status = self.run(options, args)
File "/Library/Python/2.7/site-packages/pip-8.1.2-py2.7.egg/pip/commands/install.py", line 317, in run
prefix=options.prefix_path,
File "/Library/Python/2.7/site-packages/pip-8.1.2-py2.7.egg/pip/req/req_set.py", line 736, in install
requirement.uninstall(auto_confirm=True)
File "/Library/Python/2.7/site-packages/pip-8.1.2-py2.7.egg/pip/req/req_install.py", line 742, in uninstall
paths_to_remove.remove(auto_confirm)
File "/Library/Python/2.7/site-packages/pip-8.1.2-py2.7.egg/pip/req/req_uninstall.py", line 115, in remove
renames(path, new_path)
File "/Library/Python/2.7/site-packages/pip-8.1.2-py2.7.egg/pip/utils/__init__.py", line 267, in renames
shutil.move(old, new)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 299, in move
copytree(src, real_dst, symlinks=True)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 208, in copytree
raise Error, errors …Run Code Online (Sandbox Code Playgroud) 我使用python Selenium和Scrapy来抓取一个网站.
但我的剧本很慢,
Crawled 1 pages (at 1 pages/min)
Run Code Online (Sandbox Code Playgroud)
我使用CSS SELECTOR而不是XPATH来优化时间.我改变了中间件
'tutorial.middlewares.MyCustomDownloaderMiddleware': 543,
Run Code Online (Sandbox Code Playgroud)
是Selenium太慢还是我应该在Setting中改变一些东西?
我的代码:
def start_requests(self):
yield Request(self.start_urls, callback=self.parse)
def parse(self, response):
display = Display(visible=0, size=(800, 600))
display.start()
driver = webdriver.Firefox()
driver.get("http://www.example.com")
inputElement = driver.find_element_by_name("OneLineCustomerAddress")
inputElement.send_keys("75018")
inputElement.submit()
catNums = driver.find_elements_by_css_selector("html body div#page div#main.content div#sContener div#menuV div#mvNav nav div.mvNav.bcU div.mvNavLk form.jsExpSCCategories ul.mvSrcLk li")
#INIT
driver.find_element_by_css_selector(".mvSrcLk>li:nth-child(1)>label.mvNavSel.mvNavLvl1").click()
for catNumber in xrange(1,len(catNums)+1):
print "\n IN catnumber \n"
driver.find_element_by_css_selector("ul#catMenu.mvSrcLk> li:nth-child(%s)> label.mvNavLvl1" % catNumber).click()
time.sleep(5)
self.parse_articles(driver)
pages = driver.find_elements_by_xpath('//*[@class="pg"]/ul/li[last()]/a')
if(pages):
page = driver.find_element_by_xpath('//*[@class="pg"]/ul/li[last()]/a')
checkText = …Run Code Online (Sandbox Code Playgroud) 当我尝试HtmlResponse在Scrapy中构造一个像这样的对象:
scrapy.http.HtmlResponse(url=self.base_url + dealer_url[0], body=dealer_html)
Run Code Online (Sandbox Code Playgroud)
我收到了这个错误:
Traceback (most recent call last):
File "d:\kerja\hit\python~1\<project_name>\<project_name>\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\Kerja\HIT\Python Projects\<project_name>\<project_name>\<project_name>\<project_name>\spiders\fwi.py", line 69, in parse_items
dealer_page = scrapy.http.HtmlResponse(url=self.base_url + dealer_url[0], body=dealer_html)
File "d:\kerja\hit\python~1\<project_name>\<project_name>\lib\site-packages\scrapy\http\response\text.py", line 27, in __init__
super(TextResponse, self).__init__(*args, **kwargs)
File "d:\kerja\hit\python~1\<project_name>\<project_name>\lib\site-packages\scrapy\http\response\__init__.py", line 18, in __init__
self._set_body(body)
File "d:\kerja\hit\python~1\<project_name>\<project_name>\lib\site-packages\scrapy\http\response\text.py", line 43, in _set_body
type(self).__name__)
TypeError: Cannot convert unicode body - HtmlResponse has no encoding
Run Code Online (Sandbox Code Playgroud)
有谁知道如何解决这个错误?
我无法找到该问题的答案.scrapy蜘蛛退出后如何执行python代码:
我在解析响应的函数中做了以下内容(def parse_item(self,response):):self.my_function()我定义了my_function(),但问题是它仍然在蜘蛛的循环中.我的主要想法是使用收集的数据在蜘蛛循环外的函数中执行给定代码.谢谢.
我已经被困在这几天了,这让我发疯了.
我这样叫我的scrapy蜘蛛:
scrapy crawl example -a follow_links="True"
Run Code Online (Sandbox Code Playgroud)
我传入"follow_links"标志来确定是否应该删除整个网站,或者只是我在蜘蛛中定义的索引页面.
在spider的构造函数中检查此标志以查看应设置的规则:
def __init__(self, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
self.follow_links = kwargs.get('follow_links')
if self.follow_links == "True":
self.rules = (
Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
)
else:
self.rules = (
Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
)
Run Code Online (Sandbox Code Playgroud)
如果它是"真",则允许所有链接; 如果它是"假",则所有链接都被拒绝.
到目前为止,这么好,但这些规则被忽略了.我可以获得遵循规则的唯一方法是在构造函数之外定义它们.这意味着,像这样的东西会正常工作:
class ExampleSpider(CrawlSpider):
rules = (
Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
)
def __init__(self, *args, **kwargs):
...
Run Code Online (Sandbox Code Playgroud)
所以基本上,在__init__构造函数中定义规则会导致规则被忽略,而在构造函数之外定义规则会按预期工作.
我不明白这.我的代码如下.
import re
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.html import remove_tags, remove_comments, replace_escape_chars, replace_entities, remove_tags_with_content …Run Code Online (Sandbox Code Playgroud) 我是Scrapy和Python的新手.我正在尝试使用Scrapy示例中的FormRequest,但似乎formdata参数未解析"Air"中的"[]".关于解决方法的任何想法?这是代码:
import scrapy
import re
import json
from scrapy.http import FormRequest
class AirfareSpider(scrapy.Spider):
name = 'airfare'
start_urls = [
'http://www.viajanet.com.br/busca/voos-resultados#/POA/MEX/RT/01-03-2017/15-03-2017/-/-/-/1/0/0/-/-/-/-'
]
def parse(self, response):
return [FormRequest(url='http://www.viajanet.com.br/busca/resources/api/AvailabilityStatusAsync',
formdata={"Partner":{
"Token":"p0C6ezcSU8rS54+24+zypDumW+ZrLkekJQw76JKJVzWUSUeGHzltXDhUfEntPPLFLR3vJpP7u5CZZYauiwhshw==",
"Key":"OsHQtrHdMZPme4ynIP4lcsMEhv0=",
"Id":"52",
"ConsolidatorSystemAccountId":"80",
"TravelAgencySystemAccountId":"80",
"Name":"B2C"
},
"Air":[{
"Arrival":{
"Iata":"MEX",
"Date":"2017-03-15T15:00:00.000Z"
},
"Departure":{
"Iata":"POA",
"Date":"2017-03-01T15:00:00.000Z"
},
"InBoundTime":"0",
"OutBoundTime":"0",
"CiaCodeList":"[]",
"BookingClass":"-1",
"IsRoundTrip":"true",
"Stops":"-1",
"FareType":"-"
}],
"Pax":{
"adt":"1",
"chd":"0",
"inf":"0"
},
"DisplayTotalAmount":"false",
"GetDeepLink":"false",
"GetPriceMatrixOnly":"false",
"PageLength":"10",
"PageNumber":"2"
}
, callback=self.parse_airfare)]
def parse_airfare(self, response):
data = json.loads(response.body)
Run Code Online (Sandbox Code Playgroud) 实际上它是提取scrapy 数据的scrapy教程样本.当我在Windows cmd中键入命令时,一切顺利,直到scrapy shell的样本:
scrapy shell 'http://quotes.toscrape.com/page/1/'
Run Code Online (Sandbox Code Playgroud)
我有一个例外
twisted.internet.error.DNSLookupError: DNS lookup failed: address "'http:" not found: [Errno 11001] getaddrinfo failed.
Run Code Online (Sandbox Code Playgroud)
线程Thread-1中的异常(很可能在解释器关闭期间引发):
详细信息如下:[
我搜索stackoverflow并找到类似问题的问题
,一个答案是尝试另一个终端,我尝试了Pycharm的终端,但它失败了同样的例外.
PS:我在Windows和Python 2.7.12,Anaconda 4.0.0(64位)上工作
我对scrapy很新,所以对任何帮助表示赞赏,谢谢.
我的问题是:我想从某个域中提取所有有价值的文本,例如www.example.com。因此,我转到该网站并访问深度最大为2的所有链接,并将其写入csv文件。
我用scrapy编写了模块,使用1个进程解决了这个问题,并产生了多个爬虫,但是效率很低-我能够抓取〜1k域/〜5k网站/ h,据我所知,瓶颈是CPU(因为GIL?)。离开PC一段时间后,我发现网络连接断开。
当我想使用多个进程时,我只是从扭曲中得到了错误:并行进程中Scrapy Spiders的多处理因此,这意味着我必须学习扭曲,与asyncio相比,我会说我已弃用,但这只是我的见解。
所以我有几个想法怎么办
您推荐什么解决方案?
Edit1:共享代码
class ESIndexingPipeline(object):
def __init__(self):
# self.text = set()
self.extracted_type = []
self.text = OrderedSet()
import html2text
self.h = html2text.HTML2Text()
self.h.ignore_links = True
self.h.images_to_alt = True
def process_item(self, item, spider):
body = item['body']
body = self.h.handle(str(body, 'utf8')).split('\n')
first_line = True
for piece in body:
piece = piece.strip(' \n\t\r')
if len(piece) == 0:
first_line = True
else:
e = ''
if not self.text.empty() and not first_line and not …Run Code Online (Sandbox Code Playgroud) 我有兴趣保持对scrapy项目中字段名称的顺序的引用.这个存放在哪里?
>>> dir(item)
Out[7]:
['_MutableMapping__marker',
'__abstractmethods__',
'__class__',
'__contains__',
'__delattr__',
'__delitem__',
'__dict__',
'__doc__',
'__eq__',
'__format__',
'__getattr__',
'__getattribute__',
'__getitem__',
'__hash__',
'__init__',
'__iter__',
'__len__',
'__metaclass__',
'__module__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__setitem__',
'__sizeof__',
'__slots__',
'__str__',
'__subclasshook__',
'__weakref__',
'_abc_cache',
'_abc_negative_cache',
'_abc_negative_cache_version',
'_abc_registry',
'_class',
'_values',
'clear',
'copy',
'fields',
'get',
'items',
'iteritems',
'iterkeys',
'itervalues',
'keys',
'pop',
'popitem',
'setdefault',
'update',
'values']
Run Code Online (Sandbox Code Playgroud)
我尝试了item.keys(),但是返回了一个无序的dict
python ×10
scrapy ×10
encoding ×1
middleware ×1
nutch ×1
pip ×1
pyspider ×1
python-2.7 ×1
selenium ×1
shell ×1
web-crawler ×1
web-scraping ×1