标签: scrapy

在Scrapy中获取所有蜘蛛类名称

在旧版本中,我们可以使用以下代码获取蜘蛛列表(蜘蛛名称),但在当前版本(1.4)中我遇到 [py.warnings]警告:run-all-spiders.py:17:ScrapyDeprecationWarning:CrawlerRunner .spiders属性重命名为CrawlerRunner.spider_loader.for process.spiders.list()中的spider_name:

列出我项目中所有可用的蜘蛛

用途crawler.spiders.list():

[py.warnings] WARNING: run-all-spiders.py:17: ScrapyDeprecationWarning: CrawlerRunner.spiders attribute is renamed to CrawlerRunner.spider_loader.
for spider_name in process.spiders.list():
    # list all the available spiders in my project

Run Code Online (Sandbox Code Playgroud)

如何在Scrapy中获取蜘蛛列表(和等效的类名)？

python scrapy web-scraping

Yus*_*sef

2019 04-24

1
推荐指数

1
解决办法

1046
查看次数

安装Scrapy for python3时出错

我使用pip在Ubuntu中安装Scrapy for Python3

sudo pip3 install scrapy

Run Code Online (Sandbox Code Playgroud)

安装时我收到此错误

x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.5m -c src/twisted/test/raiser.c -o build/temp.linux-x86_64-3.5/src/twisted/test/raiser.o
    src/twisted/test/raiser.c:4:20: fatal error: Python.h: No such file or directory
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Run Code Online (Sandbox Code Playgroud)

Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-_il8a07a/Twisted/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-9935fpm4-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-_il8a07a/Twisted/

Run Code Online (Sandbox Code Playgroud)

python ubuntu failed-installation scrapy

Ani*_*ddh

lucky-day

1
推荐指数

1
解决办法

1239
查看次数

Python Scrapy - service_identity(opentype)无法正常工作且无法安装

Python和linux初学者试图让scrapy启动并运行.按照https://doc.scrapy.org/en/latest/intro/tutorial.html中的说明和代码操作.获取用户警告"您没有service_identity模块的正常安装:'无法导入名称'opentype'

下载并尝试安装"service_identity",但在安装的不同部分得到"Requirment already satisfied".尝试了pip3并从下面的pypi-URL下载的.whl文件下载并安装.

在virtualbox上的lubuntu 17.04上运行python 3.5.3

mat@mat-VirtualBox:~$ scrapy startproject tutorial2
:0: UserWarning: You do not have a working installation of the service_identity module: 'cannot import name 'opentype''.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.
New Scrapy project 'tutorial2', using template directory '/usr/local/lib/python3.5/dist-packages/scrapy/templates/project', created in:
    /home/mat/tutorial2

You can start your first …

Run Code Online (Sandbox Code Playgroud)

scrapy python-3.x

作者

2019 04-25

1
推荐指数

1
解决办法

7865
查看次数

Python Scrapy extract_first()文档

从这个问题我了解到extract_first()scrapy Selector类实例的方法可以接受可选参数default,这非常有用.但是,我找不到任何描述此功能的相关官方文档.甚至Selector在Selector对象下的引用也没有提到它.也许extract_first()有一些更神奇的功能？有谁知道知道哪里extract_first()可以找到完整的描述？

python scrapy

Ost*_*nko

lucky-day

1
推荐指数

1
解决办法

1405
查看次数

Scrapy DOWNLOAD_DELAY不适用于顺序requets

我目前正在使用Scrapy Python库.

首先,我对Fitbit的登录页面(https://www.fitbit.com/login)进行FormRequest调用以登录.然后我向Fitbit的API发出近100个请求(https://api.fitbit.com).

为了不强调API(并且不被禁止!),我想在settings.py文件中使用DOWNLOAD_DELAY在请求之间设置延迟.但它不起作用.

我在教程(http://scrapy.readthedocs.io/en/latest/intro/tutorial.html)中测试它,它在那里正常工作.

你怎么看？是因为我要求一个API(应该处理那种访问)？

编辑:这是我的蜘蛛的伪代码:

class FitbitSpider:
    start_urls = ["https://www.fitbit.com/login"]

    def parse(self, response):
        yield scrapy.FormRequest(url,formdata,callback=after_login)

    def after_login(self, response):
        for i in range(100):
            yield scrapy.Request("https://api.fitbit.com/[...]")

Run Code Online (Sandbox Code Playgroud)

编辑2:这是我的settings.py文件:

BOT_NAME = 'fitbitscraper'

SPIDER_MODULES = ['fitbitscraper.spiders']
NEWSPIDER_MODULE = 'fitbitscraper.spiders'

DOWNLOAD_DELAY = 20 #20 seconds of delay should be pretty noticeable

Run Code Online (Sandbox Code Playgroud)

python api scrapy fitbit

Xem*_*ema

2018 01-09

1
推荐指数

2
解决办法

694
查看次数

scrapy中的request.headers.setdefault（）是什么意思

我想UserAgentMiddleware用沙哑的方式设定风俗。但是我不知道request.headers.setdefault('User-Agent', ua)看到它时的动作，也没有找到scrapy和requests的方法。

在哪里可以找到有关它的任何解释？

python scrapy python-requests

Yix*_*uan

2018 01-02

1
推荐指数

1
解决办法

541
查看次数

Scrapy,Scrapinghub和谷歌云存储:Keyerror'gs'在scrapinghub上运行蜘蛛

我正在使用Python 3进行scrapy项目,并将蜘蛛部署到scrapinghub.我也使用谷歌云存储来存储这里的官方文档中提到的已删除文件.

当我在本地运行蜘蛛并且蜘蛛被部署到scrapinghub而没有任何错误时,蜘蛛运行得非常好.我正在使用scrapy:1.4-py3作为scrapinghub的堆栈.在运行蜘蛛时,我收到以下错误:

    Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.6/site-packages/scrapy/crawler.py", line 77, in crawl
    self.engine = self._create_engine()
  File "/usr/local/lib/python3.6/site-packages/scrapy/crawler.py", line 102, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/usr/local/lib/python3.6/site-packages/scrapy/core/engine.py", line 70, in __init__
    self.scraper = Scraper(crawler)
  File "/usr/local/lib/python3.6/site-packages/scrapy/core/scraper.py", line 71, in __init__
    self.itemproc = itemproc_cls.from_crawler(crawler)
  File "/usr/local/lib/python3.6/site-packages/scrapy/middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python3.6/site-packages/scrapy/middleware.py", line 36, in from_settings
    mw = mwcls.from_crawler(crawler)
  File …

Run Code Online (Sandbox Code Playgroud)

scrapy python-3.x google-cloud-storage scrapinghub google-cloud-platform

Sag*_*rma

lucky-day

1
推荐指数

1
解决办法

725
查看次数

使用pyinstaller和scrapy没有这样的文件或目录错误

我有一个使用scrapy的python脚本，我想使用pyinstaller将其制作成exe文件。生成的exe文件没有任何错误，但是当我打开它时发生错误。

FileNotFoundError: [Errno 2] No such file or directory: '...\\scrapy\\VERSION'

Run Code Online (Sandbox Code Playgroud)

我尝试重新安装scrapy，但这没有帮助。我正在将Windows 10与python3一起使用

python pyinstaller scrapy

Dai*_*tas

lucky-day

1
推荐指数

1
解决办法

1456
查看次数

Scrapy Splash单击按钮不起作用

我想做什么

在avito.ru（俄罗斯房地产网站）上，隐藏人的电话，直到您单击它为止。我想使用Scrapy + Splash收集电话。

范例网址：https：//www.avito.ru/moskva/kvartiry/2-k_kvartira_84_m_412_et._992361048

单击按钮后，将显示弹出窗口，并且可以看到电话。

我使用带有以下Lua脚本的Splash execute API：

function main(splash)
    splash:go(splash.args.url)
    splash:wait(10)
    splash:runjs("document.getElementsByClassName('item-phone-button')[0].click()")
    splash:wait(10)
    return splash:png()
end

Run Code Online (Sandbox Code Playgroud)

问题

未单击该按钮，也不显示电话号码。这是一项微不足道的任务，我无法解释为什么它不起作用。

如果将替换item-phone-button为，则单击可以在同一页面上的其他字段上正常使用js-show-stat。因此，在一般情况下，Javascript 和蓝色的“显示电话”按钮必须具有某种特殊性。

我尝试过的

为了隔离问题，我创建了一个包含最少示例脚本的存储库和用于Splash的docker-compose文件：https : //github.com/alexanderlukanin13/splash-avito-phone

Javascript代码有效，您可以使用Chrome和Firefox中的Javascript控制台进行验证

document.getElementsByClassName('item-phone-button')[0].click()

Run Code Online (Sandbox Code Playgroud)

我已经在Splash 3.0、3.1、3.2版本中尝试过，结果是一样的。

更新资料

我也尝试过：

@Lore的建议，包括simulateClick()方法（请参见simulate_click分支）
mouseDown / mouseUp事件，如此处所述：模拟Tampermonkey中的mousedown，click，mouseup序列？（请参见trigger_mouse_event分支）

python scrapy splash-js-render

ale*_*n13

2018 03-19

1
推荐指数

1
解决办法

3050
查看次数

我怎样才能使用response.css在一个类中抓取一个元素

我正试图从中获取value =“ 3474636382675”：

<input class="lst" value="3474636382675" title="Zoeken" autocomplete="off" id="sbhost" maxlength="2048" name="q" type="text">

我试过了

response.css(".lst >value").extract()

Run Code Online (Sandbox Code Playgroud)

这是可行的，但是我把所有东西都收回了，我只需要价值。

response.css(".lst").extract()

Run Code Online (Sandbox Code Playgroud)

scrapy web-scraping python-3.x

Rou*_*ack

lucky-day

1
推荐指数

1
解决办法

628
查看次数

标签统计

scrapy ×10

python ×7

python-3.x ×3

web-scraping ×2

api ×1

failed-installation ×1

fitbit ×1

google-cloud-platform ×1

google-cloud-storage ×1

pyinstaller ×1

python-requests ×1

scrapinghub ×1

splash-js-render ×1

ubuntu ×1

我想做什么

问题

我尝试过的

更新资料

标签 统计

标签统计