rtl*_*kie 2 python xpath scrapy web-scraping scrape
我试图从每个列表中提取业务名称和地址并将其导出到-csv,但我遇到输出csv的问题.我认为bizs = hxs.select("// div [@ class ='listing_content']")可能会导致问题.
yp_spider.py
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from yp.items import Biz
class MySpider(BaseSpider):
name = "ypages"
allowed_domains = ["yellowpages.com"]
start_urls = ["http://www.yellowpages.com/sanfrancisco/restaraunts"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
bizs = hxs.select("//div[@class='listing_content']")
items = []
for biz in bizs:
item = Biz()
item['name'] = biz.select("//h3/a/text()").extract()
item['address'] = biz.select("//span[@class='street-address']/text()").extract()
print item
items.append(item)
Run Code Online (Sandbox Code Playgroud)
items.py
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html
from scrapy.item import Item, Field
class Biz(Item):
name = Field()
address = Field()
def __str__(self):
return "Website: name=%s address=%s" % (self.get('name'), self.get('address'))
Run Code Online (Sandbox Code Playgroud)
'scrapy crawl ypages -o list.csv -t csv'的输出是一个很长的商业名称列表,然后是位置,它会多次重复相同的数据.
小智 5
你应该添加一个"." 选择相对xpath,这里是来自scrapy文档(http://doc.scrapy.org/en/0.16/topics/selectors.html)
首先,您可能会尝试使用以下方法,这是错误的,因为它实际上提取了所有方法
文档中的元素,而不仅仅是元素内部的元素:
>>> for p in divs.select('//p') # this is wrong - gets all <p> from the whole document
>>> print p.extract()
Run Code Online (Sandbox Code Playgroud)
这是正确的方法(请注意.//p XPath前缀的点):
>>> for p in divs.select('.//p') # extracts all <p> inside
>>> print p.extract()
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2012 次 |
| 最近记录: |