我正在学习scrapy的教程:http://doc.scrapy.org/en/1.0/intro/tutorial.html
当我在教程中运行以下示例脚本时.我发现即使它已经遍历选择器列表,我从中得到的磁贴sel.xpath('a/text()').extract()仍然是一个包含一个字符串的列表.喜欢[u'Python 3 Object Oriented Programming']而不是u'Python 3 Object Oriented Programming'.在后面的示例中,列表被分配给项目item['title'] = sel.xpath('a/text()').extract(),我认为这在逻辑上是不正确的.
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/@href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc
Run Code Online (Sandbox Code Playgroud)
但是,如果我使用以下代码:
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/",
]
def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
link = href.extract()
print(link)
Run Code Online (Sandbox Code Playgroud)
这link是一个字符串而不是一个列表.
这是一个错误还是打算?
.xpath().extract()并.css().extract()返回一个列表因为.xpath()并.css()返回SelectorList对象.
请参阅https://parsel.readthedocs.org/en/v1.0.1/usage.html#parsel.selector.SelectorList.extract
(SelectorList).extract():
为每个元素调用.extract()方法是此列表,并将其结果返回为flaticned,作为unicode字符串列表.
.extract_first() 是你正在寻找的(这是很难记录的)
摘自http://doc.scrapy.org/en/latest/topics/selectors.html:
如果只想提取第一个匹配的元素,可以调用选择器
.extract_first()
>>> response.xpath('//div[@id="images"]/a/text()').extract_first()
u'Name: My image 1 '
Run Code Online (Sandbox Code Playgroud)
在你的另一个例子中:
def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
link = href.extract()
print(link)
Run Code Online (Sandbox Code Playgroud)
href循环中的每一个都是一个Selector对象.调用.extract()它会得到一个Unicode字符串:
$ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/"
2016-02-26 12:11:36 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
(...)
In [1]: response.css("ul.directory.dir-col > li > a::attr('href')")
Out[1]:
[<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>,
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>,
...
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>]
Run Code Online (Sandbox Code Playgroud)
所以.css()在response回报上SelectorList:
In [2]: type(response.css("ul.directory.dir-col > li > a::attr('href')"))
Out[2]: scrapy.selector.unified.SelectorList
Run Code Online (Sandbox Code Playgroud)
循环该对象为您提供Selector实例:
In [5]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
...: print href
...:
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>
(...)
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>
Run Code Online (Sandbox Code Playgroud)
并且调用.extract()为您提供单个Unicode字符串:
In [6]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
print type(href.extract())
...:
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
Run Code Online (Sandbox Code Playgroud)
注:.extract()对Selector被错误地记录为返回一个字符串列表.我将打开一个问题parsel(与Scrapy选择器相同,并在scrapy 1.1+中使用)
| 归档时间: |
|
| 查看次数: |
1435 次 |
| 最近记录: |