标签: web-scraping

Scrapy 检测 Xpath 是否不存在

我一直在尝试制作我的第一个爬虫，我已经完成了我需要的东西（获取 1º 商店和 2º 商店的运输信息和价格）但是使用 2 个爬虫而不是 1 个，因为我在这里有一个很大的塞子。

当有超过 1 个商店时，输出结果为：

In [1]: response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div[@class="shipping"]/p//text()').extract()
Out[1]: 
[u'ENV\xcdO 3,95\u20ac ',
 u'ENV\xcdO GRATIS',
 u'ENV\xcdO GRATIS',
 u'ENV\xcdO 4,95\u20ac ']

Run Code Online (Sandbox Code Playgroud)

为了只获得我正在使用的第二个结果：

In [2]: response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div[@class="shipping"]/p//text()')[1].extract()
Out[2]: u'ENV\xcdO GRATIS'

Run Code Online (Sandbox Code Playgroud)

但是当没有第二个结果（只有 1 个商店）时，我得到：

IndexError: list index out of range

Run Code Online (Sandbox Code Playgroud)

即使其他项目有数据，爬虫也会跳过整个页面......

在尝试了几次之后，我决定做一个快速的解决方案来获得结果，2 个爬虫 1 个用于第一家商店，另一个用于第二家，但现在我只想用 1 个履带式清洁。

一些帮助，提示或建议将不胜感激，这是我第一次尝试使用scrapy制作递归爬虫，有点像它。

有代码：

# -*- coding: utf-8 -*-
import scrapy
from Guapalia.items import GuapaliaItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class GuapaliaSpider(CrawlSpider):
    name = "guapalia"
    allowed_domains = …

Run Code Online (Sandbox Code Playgroud)

xpath web-crawler scrapy web-scraping python-2.7

Eli*_*elo

2017 09-21

1
推荐指数

1
解决办法

2252
查看次数

从 PythonAnywhere 抓取

I have a free account on PythonAnywhere from where I am trying to run the following script that locally works just fine.

I am wondering if the error I get is for technical reasons or just that PythonAnywhere forbids people to scrap from their platform for certain websites only?

Do you know of other free websites where I would be allowed to scrap anything?

import requests
from bs4 import BeautifulSoup as bs

def scrapMarketwatch(address):
    #creating formatting data from scrapdata
    r …

Run Code Online (Sandbox Code Playgroud)

beautifulsoup web-scraping pythonanywhere

use*_*529

lucky-day

1
推荐指数

1
解决办法

1369
查看次数

使用 xpath 单击 aria-label 元素

我无法使用 Xpath 和 Selenium 单击按钮，在这种情况下，aria-label 具有唯一的日期，可以完美地区分日历中的其他一些按钮。

这是显示第 5 天和当天价格的按钮的 HTML 代码。

<button type="button" class="CalendarDay__button" aria-label="Choose sunday, february 4 2018 as your check-in date. It's available." tabindex="0"><div class="calendar-day"><div class="day">4</div><div class="flybondi font-p fare-price"><span>$728*</span></div></div></button>
<button type="button" class="CalendarDay__button" aria-label="Choose monday, february 5 2018 as your check-in date. It's available." tabindex="0"><div class="calendar-day"><div class="day">5</div><div class="flybondi font-p fare-price"><span>$728*</span></div></div></button>

Run Code Online (Sandbox Code Playgroud)

#假设我想点击 2018 年 2 月 5 日，我试过了

dtd0_button = driver.find_element_by_xpath("//button[contains(@class, 'CalendarDay__button') and (@aria-label, 'Choose monday, February 5 2018 as your check-in date. It's available') ]")
dtd0_button.Click()

Run Code Online (Sandbox Code Playgroud)

这种方法有什么问题，如果我可以单击网页日历中的任何日期，我会收到以下消息“WebElement”对象没有属性“单击”。

javascript selenium xpath web-scraping python-3.x

Pet*_*tit

lucky-day

1
推荐指数

1
解决办法

1万
查看次数

我可以使用 requests.post 提交表单吗？

我试图从这个站点获取商店列表：http : //www.health.state.mn.us/divs/cfh/wic/wicstores/

我想获取当您单击“查看所有商店”按钮时生成的商店列表。我知道我可以使用 Selenium 或 MechanicalSoup 或...来做到这一点，但我希望使用请求。

看起来点击按钮提交了一个表单：

 <form name="setAllStores" id="setAllStores" action="/divs/cfh/wic/wicstores/index.cfm" method="post" onsubmit="return _CF_checksetAllStores(this)">
<input name="submitAllStores" id="submitAllStores"  type="submit" value="View All Stores" />

Run Code Online (Sandbox Code Playgroud)

但我不知道如何编写请求查询（或者甚至可能的话）。

到目前为止，我尝试的原因是以下方面的变化：

SITE = 'http://www.health.state.mn.us/divs/cfh/wic/wicstores/'
data = {'name': 'setAllStores', 'form': 'submitAllStores', 'input': 'submitAllStores'}
r = requests.post(SITE, data)

Run Code Online (Sandbox Code Playgroud)

但这不起作用。欢迎任何帮助/建议。

python beautifulsoup web-scraping python-requests

Tim*_*tty

2018 02-15

1
推荐指数

1
解决办法

4202
查看次数

如何使用来自http url的原始数据在python中下载ms word docx文件

如果在浏览器中点击以下 url，将下载 docx 文件我想用 python 自动下载。

https://hudoc.echr.coe.int/app/conversion/docx/?library=ECHR&id=001-176931&filename=NDIDI 诉联合王国案.docx&logEvent=False

我已经尝试过以下

from docx import Document
import requests
import json
from bs4 import BeautifulSoup
dwnurl = 'https://hudoc.echr.coe.int/app/conversion/docx/?library=ECHR&id=001-176931&filename=CASE%20OF%20NDIDI%20v.%20THE%20UNITED%20KINGDOM.docx&logEvent=False'
doc = requests.get(dwnurl)

print(doc.content) #printing the document like b'PK\x03\x04\x14\x00\x06\x00\x08\x00\x00\x00!\x00!\xfb\x16\x01\x16\x02\x00\x00\xec\x0c\x00\x00\x13\x00\xc4\x01[Content_Types].xml \xa2\xc0\

print(doc.raw)  #printing the document like <urllib3.response.HTTPResponse object at 0x063D8BD0>

document = Document(doc.content)
document.save('test.docx')

#on document.save i have facing these issues

Run Code Online (Sandbox Code Playgroud)

Traceback (most recent call last): File "scraping_hudoc.py", line 40, in <module> document = Document(doc.content) File "C:\Users\204387\AppData\Local\Programs\Python\Python36-32\lib\site-packages\docx\api.py", line 25, in Document document_part = Package.open(docx).main_document_part File "C:\Users\204387\AppData\Local\Programs\Python\Python36-32\lib\site-packages\docx\opc\package.py", line …

python web-scraping

Joy*_*son

2018 02-15

1
推荐指数

1
解决办法

2495
查看次数

Scrapy Cloud 蜘蛛请求因 GeneratorExit 而失败

我有一个 Scrapy 多级蜘蛛，它在本地工作，但在每次请求时都在 Cloud 中返回 GeneratorExit。

这是解析方法：

def parse(self, response):
    results = list(response.css(".list-group li a::attr(href)"))
    for c in results:
        meta = {}
        for key in response.meta.keys():
            meta[key] = response.meta[key]
        yield response.follow(c,
                              callback=self.parse_category,
                              meta=meta,
                              errback=self.errback_httpbin)

def parse_category(self, response):
    category_results = list(response.css(
        ".item a.link-unstyled::attr(href)"))
    category = response.css(".active [itemprop='title']")
    for r in category_results:
        meta = {}
        for key in response.meta.keys():
            meta[key] = response.meta[key]
        meta["category"] = category
        yield response.follow(r, callback=self.parse_item,
                              meta=meta,
                              errback=self.errback_httpbin)

def errback_httpbin(self, failure):
    # log all failures
    self.logger.error(repr(failure))

Run Code Online (Sandbox Code Playgroud)

这是回溯：

Traceback (most recent …

Run Code Online (Sandbox Code Playgroud)

python scrapy web-scraping

Muc*_*ing

2018 04-06

1
推荐指数

1
解决办法

417
查看次数

如何使用 puppeteer 查找网页上的所有资产？

我想使用 puppeteer 搜索页面并返回所有可用资产，包括图像、pdf、任何可嵌入的内容等。

出于我们的目的，让我们简单地使用图像。img标签有一个src属性，但是通过 CSS 规则加载的图像呢？有没有办法查看加载的资产总数？

javascript node.js web-scraping express puppeteer

dsp*_*099

2018 04-07

1
推荐指数

1
解决办法

982
查看次数

使用 Python request.get 的不完整 HTML 内容

我正在尝试使用 Python 中的 request.get 从 URL 获取 Html 内容。但我得到不完整的回应。

import requests
from lxml import html


url = "https://www.expedia.com/Hotel-Search?destination=Maldives&latLong=3.480528%2C73.192127&regionId=109&startDate=04%2F20%2F2018&endDate=04%2F21%2F2018&rooms=1&_xpid=11905%7C1&adults=2"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 
    (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
    'Content-Type': 'text/html',
    }

response = requests.get(url, headers=headers)
print response.content

Run Code Online (Sandbox Code Playgroud)

任何人都可以建议为获得准确的完整响应而进行的更改。

注意：使用 selenium 可以获得完整的响应，但这不是推荐的方法。

python web-scraping

Suh*_*een

2018 04-18

1
推荐指数

1
解决办法

2328
查看次数

Excel VBA - 更改输入字段的值

我对 excel vba 比较陌生，并试图实现以下目标：

我知道网上有很多类似的问题，但是我试图找到解决方案并失败了。我想使用 vba 登录网站。因此，我需要输入电子邮件地址和密码。但是，不知何故，如果我更改字段的值，网站仍在等待文本输入？？我做错了什么吗？

这是登录字段的 Html：

<div class="login">
    <div class="top">
        <a class="sprd-link" ng-href=""><svg xmlns="http://www.w3.org/2000/svg" class="icon" viewBox="0 0 32 32" key="sprd-heart">
    <!----><!----><!---->
    <!----><!---->
<path d="M 21.1 3.8 L 16 9 l -5.1 -5.1 l -9.6 9.6 L 16 28.2 l 14.8 -14.7 l -9.7 -9.7 Z M 16 23.7 L 5.7 13.4 l 5.1 -5.1 l 5.2 5.2 l 5 -5.1 l 5.1 5.1 L 16 23.7 Z" /></svg></a>
    </div>
    <div class="login-container">
        <div class="left">
            <div>
                <h1 class="text-center">Log in …

Run Code Online (Sandbox Code Playgroud)

html excel vba web-scraping

Web*_*tel

2020 06-20

1
推荐指数

1
解决办法

1968
查看次数