小编Rol*_*Max的帖子

如何在Python3.6和CentOs上安装Twisted + Scrapy

我在Centos 7上使用最新的Python,以及专用的virtualenv

(ENV) [luoc@study ~ ]$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.3.1611 (Core) 
Release:    7.3.1611
Codename:   Core

(ENV) [luoc@study ~ ]$ python --version
Python 3.6.0
Run Code Online (Sandbox Code Playgroud)

当我安装scrapy时,出错了

(ENV) [luoc@study ~ ]$ pip install scrapy
Collecting scrapy
  Using cached Scrapy-1.3.2-py2.py3-none-any.whl
Collecting cssselect>=0.9 (from scrapy)
  Using cached cssselect-1.0.1-py2.py3-none-any.whl
Requirement already satisfied: six>=1.5.2 in ./ENV/lib/python3.6/site-packages (from scrapy)
Collecting Twisted>=13.1.0 (from scrapy)
  Could not find a version that satisfies the requirement Twisted>=13.1.0 (from scrapy) (from versions: )
No …
Run Code Online (Sandbox Code Playgroud)

python twisted scrapy

6
推荐指数
2
解决办法
4326
查看次数

Scrapy忽略noindex

我正在抓取大量的网址,并想知道是否有可能让scrapy不解析带有'meta name ="robots"content ="noindex"'的网页?查看此处列出的拒绝规则http://doc.scrapy.org/en/latest/topics/link-extractors.html,看起来deny规则仅适用于URL.你有scrapy忽略xpath吗?

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from wallspider.items import Website


class Spider(CrawlSpider):
    name = "browsetest"
    allowed_domains = ["www.mydomain.com"]
    start_urls = ["http://www.mydomain.com",]

    rules = (
        Rule(SgmlLinkExtractor(allow=('/browse/')), callback="parse_items", follow= True),
        Rule(SgmlLinkExtractor(allow=(),unique=True,deny=('/[1-9]$', '(bti=)[1-9]+(?:\.[1-9]*)?', '(sort_by=)[a-zA-Z]', '(sort_by=)[1-9]+(?:\.[1-9]*)?', '(ic=32_)[1-9]+(?:\.[1-9]*)?', '(ic=60_)[0-9]+(?:\.[0-9]*)?', '(search_sort=)[1-9]+(?:\.[1-9]*)?', 'browse-ng.do\?', '/page/', '/ip/', 'out\+value', 'fn=', 'customer_rating', 'special_offers', 'search_sort=&', 'facet=' ))),
    )

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//html')
        items = []

        for site in sites:
            item = Website()
            item['url'] = …
Run Code Online (Sandbox Code Playgroud)

python web-crawler scrapy

1
推荐指数
1
解决办法
1016
查看次数

标签 统计

python ×2

scrapy ×2

twisted ×1

web-crawler ×1