我正在尝试开发一个简单的网络刮刀.我想在没有HTML代码的情况下提取文本.事实上,我实现了这个目标,但我已经看到在加载JavaScript的某些页面中我没有获得好的结果.
例如,如果某些JavaScript代码添加了一些文本,我看不到它,因为当我打电话时
response = urllib2.urlopen(request)
Run Code Online (Sandbox Code Playgroud)
我没有添加原始文本(因为JavaScript在客户端中执行).
所以,我正在寻找一些解决这个问题的想法.
使用Facebook API的最新版本(2.12)我正在尝试使用Graph API Explorer获取页面的(公共)事件.
但是,我似乎无法让它工作:
当我将鼠标悬停在左侧的灰色"id"或"name"上时,它显示"字段为空或访问令牌不允许".
现在,我在这里使用的页面是Techcrunch,他们有很多活动即将发布.所以"空"似乎不是问题.
在"不允许"方面,我检查了API参考,并在https://developers.facebook.com/docs/graph-api/reference/page/events/上.
但是,我似乎也找不到任何问题.它说"阅读页面事件需要有效的页面访问令牌或具有基本权限的用户访问令牌."
我在这里错过了什么?任何提示都非常感谢!
我正在研究Scrapy来抓取页面,但是,我无法使用javascript处理这些页面.人们建议我使用htmlunit,所以我安装了它,但我根本不知道如何使用它.任何人都可以给我一个例子(scrapy + htmlunit)吗?非常感谢.
我认为,几天后我一直在努力解决这个问题.我尝试了所有可能的(以我的最佳知识)方式但仍然没有结果.我做错了什么,但仍然无法弄清楚它是什么.非常感谢每一位愿意参加这次冒险的人.首先要做的事情是:我正在尝试使用POST方法将信息发布到delta.com上的表单.与这些网站一样,它很复杂,因为它们在会话,cookie和Javascript中,所以它可能是那里的问题.我正在使用我在stackoverflow中找到的代码示例: 使用MultipartPostHandler使用Python POST表单数据 这里是我为delta网页调整的代码.
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
from delta.items import DeltaItem
from scrapy.contrib.spiders import CrawlSpider, Rule
class DmozSpider(CrawlSpider):
name = "delta"
allowed_domains = ["http://www.delta.com"]
start_urls = ["http://www.delta.com"]
def start_requests(self, response):
yield FormRequest.from_response(response, formname='flightSearchForm',url="http://www.delta.com/booking/findFlights.do", formdata={'departureCity[0]':'JFK', 'destinationCity[0]':'SFO','departureDate[0]':'07.20.2013','departureDate[1]':'07.28.2013','paxCount':'1'},callback=self.parse1)
def parse1(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//')
items = []
for site in sites:
item = DeltaItem()
item['title'] = site.select('text()').extract()
item['link'] = site.select('text()').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
Run Code Online (Sandbox Code Playgroud)
当我指示蜘蛛在终端爬行时,我看到:
scrapy crawl delta -o items.xml -t xml …Run Code Online (Sandbox Code Playgroud) 我正在使用scrapy.
我正在使用的网站有无限滚动.
该网站有很多帖子,但我只刮了13.
如何刮掉其余帖子?
这是我的代码:
class exampleSpider(scrapy.Spider):
name = "example"
#from_date = datetime.date.today() - datetime.timedelta(6*365/12)
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/somethinghere/"
]
def parse(self, response):
for href in response.xpath("//*[@id='page-wrap']/div/div/div/section[2]/div/div/div/div[3]/ul/li/div/h1/a/@href"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
#scrape contents code here
Run Code Online (Sandbox Code Playgroud) 我正在尝试抓取这个网站:http: //www.aido.com/eshop/cl_2-c_189-p_185/stationery/pens.html
我可以在此页面中获取所有产品,但如何在页面底部发出"查看更多"链接请求?
我的代码到现在为止:
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths='//li[@class="normalLeft"]/div/a',unique=True)),
Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="topParentChilds"]/div/div[@class="clm2"]/a',unique=True)),
Rule(SgmlLinkExtractor(restrict_xpaths='//p[@class="proHead"]/a',unique=True)),
Rule(SgmlLinkExtractor(allow=('http://[^/]+/[^/]+/[^/]+/[^/]+$', ), deny=('/about-us/about-us/contact-us', './music.html', ) ,unique=True),callback='parse_item'),
)
Run Code Online (Sandbox Code Playgroud)
有帮助吗?
Facebook上的新图搜索允许您使用查询令牌搜索公司的当前员工 - 例如当前的Google员工(例如).
我想通过scrapy刮取结果页面(http://www.facebook.com/search/104958162837/employees/present).
最初的问题是facebook只允许facebook用户访问该信息,因此将我引导至login.php.所以,在抓取这个网址之前,我通过scrapy登录,然后是结果页面.但即使此页面的http响应为200,它也不会丢弃任何数据.代码如下:
import sys
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.http import Request
class DmozSpider(BaseSpider):
name = "test"
start_urls = ['https://www.facebook.com/login.php'];
task_urls = [query]
def parse(self, response):
return [FormRequest.from_response(response, formname='login_form',formdata={'email':'myemailid','pass':'myfbpassword'}, callback=self.after_login)]
def after_login(self,response):
if "authentication failed" in response.body:
self.log("Login failed",level=log.ERROR)
return
return Request(query, callback=self.page_parse)
def page_parse(self,response):
hxs = HtmlXPathSelector(response) …Run Code Online (Sandbox Code Playgroud) 我正在使用一个 CrawlSpider,它使用链接提取递归地跟踪调用下一页的链接,例如:
rules = (Rule(LinkExtractor(
allow=(),\
restrict_xpaths=('//a[contains(.,"anextpage")]')),\
callback='parse_method',\
follow=True),
)
Run Code Online (Sandbox Code Playgroud)
我已经应用这个策略来递归抓取不同的网站,只要 html 标签中有文本,比如<a href="somelink">sometext</a>,一切正常。
我现在正在尝试抓取一个具有
<div class="bui-pagination__item bui-pagination__next-arrow">
<a class="pagenext" href="/url.html" aria-label="Pagina successiva">
<svg class="bk-icon -iconset-navarrow_right bui-pagination__icon" height="18" role="presentation" width="18" viewBox="0 0 128 128">
<path d="M54.3 96a4 4 0 0 1-2.8-6.8L76.7 64 51.5 38.8a4 4 0 0 1 5.7-5.6L88 64 57.2 94.8a4 4 0 0 1-2.9 1.2z"></path>
</svg>
</a>
</div>
作为“下一步”按钮而不是简单的文本,我的 LinkExtractor 规则似乎不再适用,并且蜘蛛在第一页后停止。
我试图寻找 svg 元素,但这似乎并没有触发提取:
restrict_xpaths=('//a[contains(.,name()=svg) and contains(@class,"nextpageclass")]'))
Run Code Online (Sandbox Code Playgroud)
有什么我想念的吗?
I've built a simple scrapy spider running on scrapinghub:
class ExtractionSpider(scrapy.Spider):
name = "extraction"
allowed_domains = ['domain']
start_urls = ['http://somedomainstart']
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
def parse(self, response):
urls = response.css('a.offer-details__title-link::attr(href)').extract()
print(urls)
for url in urls:
url = response.urljoin(url)
yield SplashRequest(url=url, callback=self.parse_details)
multiple_locs_urls = response.css('a.offer-regions__label::attr(href)').extract()
print(multiple_locs_urls)
for url in multiple_locs_urls:
url = response.urljoin(url)
yield SplashRequest(url=url, callback=self.parse_details)
next_page_url = response.css('li.pagination_element--next > a.pagination_trigger::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield SplashRequest(url=next_page_url, callback=self.parse)
def parse_details(self, …Run Code Online (Sandbox Code Playgroud) 蜘蛛参考:
import scrapy
from scrapy.spiders import Spider
from scrapy.selector import Selector
from script.items import ScriptItem
class RunSpider(scrapy.Spider):
name = "run"
allowed_domains = ["stopitrightnow.com"]
start_urls = (
'http://www.stopitrightnow.com/',
)
def parse(self, response):
for widget in response.xpath('//div[@class="shopthepost-widget"]'):
#print widget.extract()
item = ScriptItem()
item['url'] = widget.xpath('.//a/@href').extract()
url = item['url']
#print url
yield item
Run Code Online (Sandbox Code Playgroud)
当我运行它时,终端输出如下:
2015-08-21 14:23:51 [scrapy] DEBUG: Scraped from <200 http://www.stopitrightnow.com/>
{'url': []}
<div class="shopthepost-widget" data-widget-id="708473">
<script type="text/javascript">!function(d,s,id){var e, p = /^http:/.test(d.location) ? 'http' : 'https';if(!d.getElementById(id)) {e = d.createElement(s);e.id = id;e.src = …Run Code Online (Sandbox Code Playgroud) 我正在尝试将scrapy与selenium一起使用,以便能够与javascript交互,并且仍然拥有scrapy提供的强大的抓取框架。我编写了一个访问http://www.iens.nl的脚本,在搜索栏中输入“阿姆斯特丹”,然后成功点击搜索按钮。单击搜索按钮后,我希望scrapy 从新呈现的页面中检索一个元素。不幸的是,scrapy 不返回任何值。
这是我的代码的样子:
from selenium import webdriver
from scrapy.loader import ItemLoader
from scrapy import Request
from scrapy.crawler import CrawlerProcess
from properties import PropertiesItem
import scrapy
class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["web"]
# Start on a property page
start_urls = ['http://www.iens.nl']
def __init__(self):
chrome_path = '/Users/username/Documents/chromedriver'
self.driver = webdriver.Chrome(chrome_path)
def parse(self, response):
self.driver.get(response.url)
text_box = self.driver.find_element_by_xpath('//*[@id="searchText"]')
submit_button = self.driver.find_element_by_xpath('//*[@id="button_search"]')
text_box.send_keys("Amsterdam")
submit_button.click()
l = ItemLoader(item=PropertiesItem(), response=response)
l.add_xpath('description', '//*[@id="results"]/ul/li[1]/div[2]/h3/a/')
return l.load_item()
process = CrawlerProcess()
process.crawl(BasicSpider)
process.start()
Run Code Online (Sandbox Code Playgroud)
“properties”是另一个脚本,看起来像这样:
from …Run Code Online (Sandbox Code Playgroud) scrapy ×9
python ×6
web-scraping ×6
javascript ×3
facebook ×2
selenium ×2
web-crawler ×2
forms ×1
htmlunit ×1
post ×1
python-2.x ×1
request ×1
scrapinghub ×1
sitemap ×1
urlopen ×1
web ×1