小编Win*_*ant的帖子

Scrapy如何处理Javascript

蜘蛛参考:

import scrapy
from scrapy.spiders import Spider
from scrapy.selector import Selector
from script.items import ScriptItem



    class RunSpider(scrapy.Spider):
        name = "run"
        allowed_domains = ["stopitrightnow.com"]
        start_urls = (
            'http://www.stopitrightnow.com/',
        )



        def parse(self, response):


            for widget in response.xpath('//div[@class="shopthepost-widget"]'):
                #print widget.extract()
                item = ScriptItem()
                item['url'] = widget.xpath('.//a/@href').extract()
                url = item['url']
                #print url
                yield item
Run Code Online (Sandbox Code Playgroud)

当我运行它时,终端输出如下:

2015-08-21 14:23:51 [scrapy] DEBUG: Scraped from <200 http://www.stopitrightnow.com/>
{'url': []}
<div class="shopthepost-widget" data-widget-id="708473">
<script type="text/javascript">!function(d,s,id){var e, p = /^http:/.test(d.location) ? 'http' : 'https';if(!d.getElementById(id)) {e = d.createElement(s);e.id = id;e.src = …
Run Code Online (Sandbox Code Playgroud)

javascript selenium scrapy web-scraping scrapy-spider

4
推荐指数
2
解决办法
1751
查看次数

没有名为html2text的模块

我已经尝试了各种方法来安装库html2text,所有这些都是因为ipython无法导入它并显示错误消息

"ImportError:没有名为html2text的模块"

The directory '/Users/NDunn/Library/Caches/pip/http' or its parent  directory is not
 owned by the current user and the cache has been disabled.
 Please check the permissions and owner of that directory.
 If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/NDunn/Library/Caches/pip/http' or its parent directory is not owned
by the current user and the cache has been disabled.
Please check the permissions and owner of that directory.
If executing pip with sudo, you may want sudo's …
Run Code Online (Sandbox Code Playgroud)

python ipython

2
推荐指数
1
解决办法
6307
查看次数