我想获取所有链接和start_time和end_time一个页面,然后发送到函数(parse_detail)废弃另一个信息但我不知道如何使用selenium for循环
这是我的代码并且有错误:
for site in sites:
exceptions.TypeError: 'WebElement' object is not iterable
Run Code Online (Sandbox Code Playgroud)
请教我如何在硒中使用环状scrapy.谢谢!
class ProductSpider(Spider):
name = "city20140808"
start_urls = ['http://wwwtt.tw/11']
def __init__(self):
self.driver = webdriver.Firefox()
dispatcher.connect(self.spider_closed, signals.spider_closed)
def parse(self, response):
self.driver.get(response.url)
item = CitytalkItem()
sites = self.driver.find_element_by_css_selector("div.body ")
for site in sites:
linkiwant = site.find_element_by_css_selector(".heading a")
start = site.find_element_by_css_selector("div.content p.m span.date")
end = site.find_element_by_css_selector("div.content p.m span.date")
item['link'] = linkiwant.get_attribute("href")
item['start_date'] = start.text
item['end_date'] = end.text
yield Request(url=item['link'], meta={'items':items}, callback=self.parse_detail)
def parse_detail(self,response):
item = response.meta['items']
........
yield item
Run Code Online (Sandbox Code Playgroud) python selenium web-scraping selenium-webdriver scrapy-spider
如何将参数传递给URL上的请求,如下所示:
site.com/search/?action=search&description=My Search here&e_author=
Run Code Online (Sandbox Code Playgroud)
如何将参数放在Spider Request的结构上,如下例所示:
req = Request(url="site.com/",parameters={x=1,y=2,z=3})
Run Code Online (Sandbox Code Playgroud) 我正在尝试使用刮削框架进行刮削。一些请求被重定向,但 start_requests 中设置的回调函数不会为这些重定向的 url 请求调用,但对于非重定向的请求工作正常。
我在 start_requests 函数中有以下代码:
for user in users:
yield scrapy.Request(url=userBaseUrl+str(user['userId']),cookies=cookies,headers=headers,dont_filter=True,callback=self.parse_p)
Run Code Online (Sandbox Code Playgroud)
但是这个 self.parse_p 只被非 302 请求调用。
我将从尝试用于遍历一系列车辆并提取模型和价格的草率代码开始:
def parse(self, response):
hxs = Selector(response)
split_url = response.url.split("/")
listings = hxs.xpath("//div[contains(@class,'listing-item')]")
for vehicle in listings:
item = Vehicle()
item['make'] = split_url[5]
item['price'] = vehicle.xpath("//div[contains(@class,'price')]/text()").extract()
item['description'] = vehicle.xpath("//div[contains(@class,'title-module')]/h2/a/text()").extract()
yield item
Run Code Online (Sandbox Code Playgroud)
我本以为可以遍历清单并仅返回被解析的单个车辆的价格,但是实际上它会将页面上所有价格的数组添加到每个车辆项目中。
我认为问题出在我的xpath选择器中-是否"//div[contains(@class,'price')]/text()"以某种方式允许解析器查看应每次解析的单车外的div?
作为参考,如果我这样做,listings[1]它仅返回1个列表,因此循环应该正常工作。
编辑:我在print vehicle.extract()上面添加了一行,并确认那vehicle肯定只是一个项目(并且每次循环迭代时它都会更改)。应用于车辆的xpath选择器如何能够逃离车辆对象并返回所有价格?
我刚开始使用scrapy.我在scrapy登录时遇到的问题很少.我正在尝试网站www.instacart.com上的scrape项目.但我遇到登录问题.
以下是代码
import scrapy
from scrapy.loader import ItemLoader
from project.items import ProjectItem
from scrapy.http import Request
from scrapy import optional_features
optional_features.remove('boto')
class FirstSpider(scrapy.Spider):
name = "first"
allowed_domains = ["https://instacart.com"]
start_urls = [
"https://www.instacart.com"
]
def init_request(self):
return Request(url=self.login_page, callback=self.login)
def login(self, response):
return scrapy.FormRequest('https://www.instacart.com/#login',
formdata={'email': 'xxx@xxx.com', 'password': 'xxxxxx',
},
callback=self.parse)
def check_login_response(self, response):
return scrapy.Request('https://www.instacart.com/', self.parse)
def parse(self, response):
if "Goutam" in response.body:
print "Successfully logged in. Let's start crawling!"
else:
print "Login unsuccessful"
Run Code Online (Sandbox Code Playgroud)
以下是错误消息
C:\Users\gouta\PycharmProjects\CSG_Scraping\csg_wholefoods>scrapy crawl first
2016-06-15 10:44:50 …Run Code Online (Sandbox Code Playgroud) 我有一个问题,使用Scrapy输出内报价。我想包含逗号,这导致在像这样的一些列双引号废料数据:
TEST,TEST,TEST,ON,TEST,TEST,"$2,449,000, 4,735 Sq Ft, 6 Bed, 5.1 Bath, Listed 03/01/2016"
TEST,TEST,TEST,ON,TEST,TEST,"$2,895,000, 4,975 Sq Ft, 5 Bed, 4.1 Bath, Listed 01/03/2016"
Run Code Online (Sandbox Code Playgroud)
只有逗号列获得双引号括起来。我怎么能双引号我所有的数据列?
我想Scrapy输出:
"TEST","TEST","TEST","ON","TEST","TEST","$2,449,000, 4,735 Sq Ft, 6 Bed, 5.1 Bath, Listed 03/01/2016"
"TEST","TEST","TEST","ON","TEST","TEST","$2,895,000, 4,975 Sq Ft, 5 Bed, 4.1 Bath, Listed 01/03/2016"
Run Code Online (Sandbox Code Playgroud)
有什么我可以更改的设置吗?
我正在创建一个爬虫,它接受用户输入并抓取网站上的所有链接。但是,我只需要限制对来自该域的链接的链接的抓取和提取,而不是外部域。就爬虫而言,我把它放到了我需要的地方。我的问题是,对于我的 allowed_domains 函数,我似乎无法传入通过命令放入的 scrapy 选项。Bellow 是第一个运行的脚本:
# First Script
import os
def userInput():
user_input = raw_input("Please enter URL. Please do not include http://: ")
os.system("scrapy runspider -a user_input='http://" + user_input + "' crawler_prod.py")
userInput()
Run Code Online (Sandbox Code Playgroud)
它运行的脚本是爬虫,爬虫将爬取给定的域。下面是爬虫代码:
#Crawler
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import Request
from scrapy.http import Request
class InputSpider(CrawlSpider):
name = "Input"
#allowed_domains = ["example.com"]
def allowed_domains(self):
self.allowed_domains = user_input
def start_requests(self): …Run Code Online (Sandbox Code Playgroud) 我跑Scrapy (1.4.0版本)使用的脚本CrawlerProcess。网址来自用户输入。第一次运行良好,但第二次twisted.internet.error.ReactorNotRestartable出现错误。所以,程序卡在那里。
爬虫进程部分:
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(GeneralSpider)
print('~~~~~~~~~~~~ Processing is going to be started ~~~~~~~~~~')
process.start()
print('~~~~~~~~~~~~ Processing ended ~~~~~~~~~~')
process.stop()
Run Code Online (Sandbox Code Playgroud)
这是第一次运行输出:
~~~~~~~~~~~~ Processing is going to be started ~~~~~~~~~~
2017-07-17 05:58:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.some-url.com/content.php> (referer: None)
2017-07-17 05:58:46 [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'HtmlResponse' in <GET http://www.some-url.com/content.php>
2017-07-17 05:58:46 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-17 05:58:46 …Run Code Online (Sandbox Code Playgroud) 我是Scrapy的新手,我真的迷失了如何在一个区块中退回多个项目。
基本上,我得到了一个HTML标签,该标签的引号包含嵌套的文本标签,作者姓名和一些有关该引号的标签。
这里的代码仅返回一个引号,仅此而已。它不使用循环返回其余部分。我已经在网上搜索了好几个小时,但无可救药。到目前为止,这是我的代码:
蜘蛛
import scrapy
from scrapy.loader import ItemLoader
from first_spider.items import FirstSpiderItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
l = ItemLoader(item = FirstSpiderItem(), response=response)
quotes = response.xpath("//*[@class='quote']")
for quote in quotes:
text = quote.xpath(".//span[@class='text']/text()").extract_first()
author = quote.xpath(".//small[@class='author']/text()").extract_first()
tags = quote.xpath(".//meta[@class='keywords']/@content").extract_first()
# removes quotation marks from the text
for c in ['“', '”']:
if c in text:
text = text.replace(c, "")
l.add_value('text', text)
l.add_value('author', author)
l.add_value('tags', tags)
return l.load_item()
next_page_path = …Run Code Online (Sandbox Code Playgroud) 我从命令行使用scrapy shell并使用scrapy shell"abcwebsitexyz.com"来测试我的一些代码及其值,但我想用它传递表单数据.就像我在下面试过的那样
scrapy shell "abcwebsitexyz.com", formdata={'username': 'user_name','password':'password',}
Run Code Online (Sandbox Code Playgroud)
但它不起作用.
请帮忙.
scrapy-spider ×10
scrapy ×9
python ×8
web-scraping ×4
web-crawler ×3
csv ×1
python-2.7 ×1
python-3.x ×1
redirect ×1
selenium ×1
twisted ×1
xpath ×1