最新版本的scraperwiki取决于Poppler(或者github所说).不幸的是,它只指定了如何在OSX和Linux上获取它,而不是Windows.一个快速的谷歌没有任何太有希望,所以有谁知道如何让Windows上的Poppler为scraperwiki?
我想用lxml解析下载的RSS,但我不知道如何处理UnicodeDecodeError?
request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)
Run Code Online (Sandbox Code Playgroud)
但是我收到一个错误:
tree = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
File "parser.pxi", line 559, …Run Code Online (Sandbox Code Playgroud) 是scraperwikiPython模块可安装Scraperwiki.com Web界面之外?看起来源可用,但未打包.
我想凑英国食品评级机构数据的aspx SEACH结果页面(E,G.http://ratings.food.gov.uk/QuickSearch.aspx?q=po30上scraperwiki使用机械化/ Python的()HTTP:/ /scraperwiki.com/scrapers/food_standards_agency/)但在尝试关注具有以下形式的"下一页"链接时遇到问题:
<input type="submit" name="ctl00$ContentPlaceHolder1$uxResults$uxNext" value="Next >" id="ctl00_ContentPlaceHolder1_uxResults_uxNext" title="Next >" />
Run Code Online (Sandbox Code Playgroud)
表单处理程序如下所示:
<form method="post" action="QuickSearch.aspx?q=po30" onsubmit="javascript:return WebForm_OnSubmit();" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_ContentPlaceHolder1_buttonSearch')" id="aspnetForm">
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" />
Run Code Online (Sandbox Code Playgroud)
手动单击"下一步"链接时的HTTP跟踪显示__EVENTTARGET为空?我可以在其他刮刀上找到的所有婴儿床都显示__EVENTTARGET的操纵作为处理下一页的方式.
实际上,我不确定我要抓的页面是如何加载下一页的?无论我在刮刀上扔什么,它只能设法加载第一个结果页面.(即使能够改变每页的结果数量也很有用,但我也看不出怎么做!)
那么 - 关于如何刮取N + 0的1 + N'结果页面的任何想法?
在URL中包含"alpha"的链接上有许多链接(hrefs),我想从20个不同的页面收集这些链接并粘贴到通用URL的末尾(第二行最后一行).href可以在一个表中找到,该类对于td是mys-elastic mys-left,而a显然是包含href属性的元素.任何帮助都会非常感激,因为我已经在这里工作了大约一个星期.
for i in range(1, 11):
# The HTML Scraper for the 20 pages that list all the exhibitors
url = 'http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm?alpha=%40&type=alpha&page=' + str(i) + '#GotoResults'
print url
list_html = scraperwiki.scrape(url)
root = lxml.html.fromstring(list_html)
href_element = root.cssselect('td.mys-elastic mys-left a')
for element in href_element:
# Convert HTMl to lxml Object
href = href_element.get('href')
print href
page_html = scraperwiki.scrape('http://ahr13.mapyourshow.com' + href)
print page_html
Run Code Online (Sandbox Code Playgroud) scraperwiki ×5
python ×4
lxml ×2
asp.net ×1
chardet ×1
mechanize ×1
poppler ×1
python-2.7 ×1
rss ×1
web-scraping ×1