我是python的新手,我正在使用Scrapy库进行网络抓取项目.我没有使用内置域限制,因为我想检查域外页面的任何链接是否已经死亡.但是,我仍然希望区域内的页面与其他页面不同,并且在解析响应之前尝试手动确定站点是否在域内.
回复网址:
http://www.siteSection1.domainName.com
Run Code Online (Sandbox Code Playgroud)
如果声明:
if 'domainName.com' and ('siteSection1' or 'siteSection2' or 'siteSection3') in response.url:
parsePageInDomain()
Run Code Online (Sandbox Code Playgroud)
上述声明是真实的(页面解析),如果"siteSection1"是第一次出现在或公司的名单,但如果响应URL是相同的,但如果语句是下面也不会解析页面:
if 'domainName.com' and ('siteSection2' or 'siteSection1' or 'siteSection3') in response.url:
parsePageInDomain()
Run Code Online (Sandbox Code Playgroud)
我在这做错了什么?我无法非常清楚地思考逻辑运算符的情况,我们将非常感谢任何指导.谢谢!
我正在尝试在运行Yosemite的Macbook Pro上安装Scrapy.我尝试通过在终端中运行以下命令来关注其网站上的文档以进行灌输.
pip install Scrapy
在灌输期间,抛出以下异常:
Exception:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip-1.5.6 py2.7.egg/pip/basecommand.py", line 122, in main
status = self.run(options, args)
File "/Library/Python/2.7/site-packages/pip-1.5.6-py2.7.egg/pip/commands/install.py", line 283, in run
requirement_set.install(install_options, global_options, root=options.root_path)
File "/Library/Python/2.7/site-packages/pip-1.5.6-py2.7.egg/pip/req.py", line 1435, in install
requirement.install(install_options, global_options, *args, **kwargs)
File "/Library/Python/2.7/site-packages/pip-1.5.6-py2.7.egg/pip/req.py", line 671, in install
self.move_wheel_files(self.source_dir, root=root)
File "/Library/Python/2.7/site-packages/pip-1.5.6-py2.7.egg/pip/req.py", line 901, in move_wheel_files
pycompile=self.pycompile,
File "/Library/Python/2.7/site-packages/pip-1.5.6-py2.7.egg/pip/wheel.py", line 215, in move_wheel_files
clobber(source, lib_dir, True)
File "/Library/Python/2.7/site-packages/pip-1.5.6-py2.7.egg/pip/wheel.py", line 205, in clobber
os.makedirs(destdir)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.py", line 157, in makedirs
mkdir(name, …Run Code Online (Sandbox Code Playgroud) 到目前为止对我来说还可以。我的问题是如何进一步抓取此URL列表?搜索后,我知道我可以在解析中返回一个请求,但似乎只能处理一个URL。
这是我的解析:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
return scrapy.Request(list[0])
# It works, but how can I continue b.com and c.com?
Run Code Online (Sandbox Code Playgroud)
我可以那样做吗?
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
Run Code Online (Sandbox Code Playgroud)
完整版本:
import scrapy
class MySpider(scrapy.Spider):
name = "mySpider"
allowed_domains = ["x.com"]
start_urls = ["http://x.com"]
def …Run Code Online (Sandbox Code Playgroud) 我最近在学习python和scrapy.我用Google搜索并搜索了几天,但我似乎没有找到任何有关如何使用隐藏网址抓取多个网页的说明 - <a href="javascript:;".基本上每个页面包含20个列表,每次单击">>"按钮,它将加载下20个项目.我无法弄清楚如何找到实际的网址,下面是您参考的源代码.任何指针和帮助非常感谢.
我的目的是在这个网页上运行scrapy爬虫:http://visit.rio/en/o-que-fazer/outdoors/ .但是,id ="container"上有一些资源只能通过JavaScript按钮("VER MAIS")加载.我读过一些关于硒的东西,但我什么都没有.
我想从新闻网站RSS Feed中提取内容,如下所示
<item>
<title>BPS: Kartu Bansos Bantu Turunkan Angka Gini Ratio</title>
<media:content url="/image.jpg" expression="full" type="image/jpeg"/> </item>
Run Code Online (Sandbox Code Playgroud)
但是引发错误当使用像media.xpath('// media:content')之类的xpath解析信息时使用像media:content这样的 内容
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/parsel/selector.py", line 183, in xpath
six.reraise(ValueError, ValueError(msg), sys.exc_info()[2])
File "/usr/local/lib/python2.7/site-packages/parsel/selector.py", line 179, in xpath
smart_strings=self._lxml_smart_strings)
File "src/lxml/lxml.etree.pyx", line 1587, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:57923)
File "src/lxml/xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:167084)
File "src/lxml/xpath.pxi", line 227, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:166043)
ValueError: XPath error: Undefined namespace prefix in //media:content
Run Code Online (Sandbox Code Playgroud)
有人知道我该怎么办?谢谢 :)
考虑到多个网址,我正在尝试抓取并抓取多个网页.我正在使用维基百科进行测试,为了更容易,我只为每个页面使用了相同的Xpath选择器,但我最终想要使用每个页面独有的许多不同的Xpath选择器,因此每个页面都有自己独立的parsePage方法.
当我不使用项目加载器时,此代码可以正常工作,并且只是直接填充项目.当我使用项目加载器时,项目被奇怪地填充,并且它似乎完全忽略了在parse方法中分配的回调并且仅使用start_urls用于parsePage方法.
import scrapy
from scrapy.http import Request
from scrapy import Spider, Request, Selector
from testanother.items import TestItems, TheLoader
class tester(scrapy.Spider):
name = 'vs'
handle_httpstatus_list = [404, 200, 300]
#Usually, I only get data from the first start url
start_urls = ['https://en.wikipedia.org/wiki/SANZAAR','https://en.wikipedia.org/wiki/2016_Rugby_Championship','https://en.wikipedia.org/wiki/2016_Super_Rugby_season']
def parse(self, response):
#item = TestItems()
l = TheLoader(item=TestItems(), response=response)
#when I use an item loader, the url in the request is completely ignored. without the item loader, it works properly.
request = Request("https://en.wikipedia.org/wiki/2016_Rugby_Championship", callback=self.parsePage1, meta={'loadernext':l}, dont_filter=True)
yield …Run Code Online (Sandbox Code Playgroud) <div id="content-body-14269002-17290547">
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
</div>
Run Code Online (Sandbox Code Playgroud)
我需要选择一切 id = "content-body*"
每个页面上的内容 - 主体更改,可能需要使用通配符?
我想从多个网址抓取信息.我使用以下代码但它不起作用.愿有人请我指出我出错的地方吗?
import scrapy
class spider1(scrapy.Spider):
name = "spider1"
domain = "http://www.amazon.com/dp/"
ASIN = ['B01LA6171I', 'B00OUKHTLO','B00B7LUVZK']
def start_request(self):
for i in ASIN:
yield scrapy.Request(url=domain+i,callback = self.parse)
def parse(self, response):
title =response.css("span#productTitle::text").extract_first().strip()
ASIN_ext = response.xpath("//input[@name='ASIN']/@value").extract_first()
data = {"ASIN":ASIN_ext,"title":title,}
yield data
Run Code Online (Sandbox Code Playgroud) 我想在自定义基本蜘蛛类中具有一些针对蜘蛛的通用功能。
通常,y抓的蜘蛛从scrapy.Spider类继承。
我尝试在scrapy的spiders文件夹中创建BaseSpider类,但该类无效
import scrapy
class BaseSpider(scrapy.Spider):
def __init__(self):
super(scrapy.Spider).__init__()
def parse(self, response):
pass
Run Code Online (Sandbox Code Playgroud)
这是我真正的蜘蛛
import scrapy
import BaseSpider
class EbaySpider(BaseSpider):
name = "ebay"
allowed_domains = ["ebay.com"]
def __init__(self):
self.redis = Redis(host='redis', port=6379)
# rest of the spider code
Run Code Online (Sandbox Code Playgroud)
给出这个错误
TypeError: Error when calling the metaclass bases
module.__init__() takes at most 2 arguments (3 given)
Run Code Online (Sandbox Code Playgroud)
然后我尝试使用多重继承,使我的eBay Spider看起来像
class EbaySpider(scrapy.Spider, BaseSpider):
name = "ebay"
allowed_domains = ["ebay.com"]
def __init__(self):
self.redis = Redis(host='redis', port=6379)
# rest of the spider code …Run Code Online (Sandbox Code Playgroud) scrapy ×10
python ×9
web-crawler ×4
javascript ×2
xpath ×2
amazon ×1
dom ×1
install ×1
installation ×1
logic ×1
macos ×1
operators ×1
pagination ×1