小编Ony*_*Lam的帖子

将硒响应网址传递给scrapy

我正在学习Python,并试图在下拉菜单中抓取此页面以获取特定值.之后,我需要单击结果表上的每个项目以检索特定信息.我能够选择项目并检索webdriver上的信息.但我不知道如何将响应URL传递给crawlspider.

driver = webdriver.Firefox()
driver.get('http://www.cppcc.gov.cn/CMS/icms/project1/cppcc/wylibary/wjWeiYuanList.jsp')
more_btn = WebDriverWait(driver, 20).until(
     EC.visibility_of_element_located((By.ID, '_button_select'))
            )  
more_btn.click()

## select specific value from the dropdown
driver.find_element_by_css_selector("select#tabJcwyxt_jiebie >     option[value='teyaoxgrs']").click()
driver.find_element_by_css_selector("select#tabJcwyxt_jieci > option[value='d11jie']").click()
search2 = driver.find_element_by_class_name('input_a2')
search2.click()
time.sleep(5)

## convert html to "nice format"
text_html=driver.page_source.encode('utf-8')
html_str=str(text_html)

## this is a hack that initiates a "TextResponse" object (taken from the Scrapy module)
resp_for_scrapy=TextResponse('none',200,{},html_str,[],None)

## convert html to "nice format"
text_html=driver.page_source.encode('utf-8')
html_str=str(text_html)

resp_for_scrapy=TextResponse('none',200,{},html_str,[],None)
Run Code Online (Sandbox Code Playgroud)

所以这就是我被困住的地方.我能够使用上面的代码进行查询.但是如何将resp_for_scrapy传递给crawlspider?我把resp_for_scrapy替换为项目但是没有用.

## spider 
class ProfileSpider(CrawlSpider):
name …
Run Code Online (Sandbox Code Playgroud)

python selenium scrapy

6
推荐指数
2
解决办法
9981
查看次数

使用scrapy按扩展名类型保存网页上的文件

我是Python的新手,我正在尝试使用scrapy下载并保存本网站的pdf文件:http: //www.legco.gov.hk/general/chinese/counmtg/yr04-08/mtg_0708.htm#中文译名

以下是我的代码:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class legco(BaseSpider):
  name = "legco"
  allowed_domains = ["http://www.legco.gov.hk/"]
  start_urls = ["http://www.legco.gov.hk/general/chinese/counmtg/yr04-08/mtg_0708.htm#hansard"]
  rules =(
    Rule(SgmlLinkExtractor(allow=r"\.pdf"), callback="save_pdf")
          )

def parse_listing(self, response):
    hxs = HtmlXPathSelector(response)
    pdf_urls=hxs.select("a/@href").extract()
    for url in pdf_urls:
        yield Request(url, callback=self.save_pdf)

def save_pdf(self, response):
    path = self.get_path(response.url)
    with open(path, "wb") as f:
        f.write(response.body)
Run Code Online (Sandbox Code Playgroud)

基本上我试图将搜索限制为只与".pdf"链接,然后选择"a/@ hfref".

从输出,我看到这个错误:

2015-03-09 11:00:22-0700 [legco]错误:蜘蛛错误处理http://www.legco.gov.hk/general/chinese/counmtg/yr04-08/mtg_0708.htm#hansard>

任何人都可以建议我如何修复我的代码?非常感谢!

python scrapy web-scraping

2
推荐指数
1
解决办法
2345
查看次数

标签 统计

python ×2

scrapy ×2

selenium ×1

web-scraping ×1