我正在学习Python,并试图在下拉菜单中抓取此页面以获取特定值.之后,我需要单击结果表上的每个项目以检索特定信息.我能够选择项目并检索webdriver上的信息.但我不知道如何将响应URL传递给crawlspider.
driver = webdriver.Firefox()
driver.get('http://www.cppcc.gov.cn/CMS/icms/project1/cppcc/wylibary/wjWeiYuanList.jsp')
more_btn = WebDriverWait(driver, 20).until(
EC.visibility_of_element_located((By.ID, '_button_select'))
)
more_btn.click()
## select specific value from the dropdown
driver.find_element_by_css_selector("select#tabJcwyxt_jiebie > option[value='teyaoxgrs']").click()
driver.find_element_by_css_selector("select#tabJcwyxt_jieci > option[value='d11jie']").click()
search2 = driver.find_element_by_class_name('input_a2')
search2.click()
time.sleep(5)
## convert html to "nice format"
text_html=driver.page_source.encode('utf-8')
html_str=str(text_html)
## this is a hack that initiates a "TextResponse" object (taken from the Scrapy module)
resp_for_scrapy=TextResponse('none',200,{},html_str,[],None)
## convert html to "nice format"
text_html=driver.page_source.encode('utf-8')
html_str=str(text_html)
resp_for_scrapy=TextResponse('none',200,{},html_str,[],None)
Run Code Online (Sandbox Code Playgroud)
所以这就是我被困住的地方.我能够使用上面的代码进行查询.但是如何将resp_for_scrapy传递给crawlspider?我把resp_for_scrapy替换为项目但是没有用.
## spider
class ProfileSpider(CrawlSpider):
name …Run Code Online (Sandbox Code Playgroud) 我是Python的新手,我正在尝试使用scrapy下载并保存本网站的pdf文件:http: //www.legco.gov.hk/general/chinese/counmtg/yr04-08/mtg_0708.htm#中文译名
以下是我的代码:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class legco(BaseSpider):
name = "legco"
allowed_domains = ["http://www.legco.gov.hk/"]
start_urls = ["http://www.legco.gov.hk/general/chinese/counmtg/yr04-08/mtg_0708.htm#hansard"]
rules =(
Rule(SgmlLinkExtractor(allow=r"\.pdf"), callback="save_pdf")
)
def parse_listing(self, response):
hxs = HtmlXPathSelector(response)
pdf_urls=hxs.select("a/@href").extract()
for url in pdf_urls:
yield Request(url, callback=self.save_pdf)
def save_pdf(self, response):
path = self.get_path(response.url)
with open(path, "wb") as f:
f.write(response.body)
Run Code Online (Sandbox Code Playgroud)
基本上我试图将搜索限制为只与".pdf"链接,然后选择"a/@ hfref".
从输出,我看到这个错误:
2015-03-09 11:00:22-0700 [legco]错误:蜘蛛错误处理http://www.legco.gov.hk/general/chinese/counmtg/yr04-08/mtg_0708.htm#hansard>
任何人都可以建议我如何修复我的代码?非常感谢!