Scrapy does not fetch markup on response.css

Question

Scrapy does not fetch markup on response.css

qub*_*its 5 python scrapy web-scraping scrapinghub splash-js-render

I've built a simple scrapy spider running on scrapinghub:

class ExtractionSpider(scrapy.Spider):
    name = "extraction"
    allowed_domains = ['domain']
    start_urls = ['http://somedomainstart']
    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"

    def parse(self, response):
        urls = response.css('a.offer-details__title-link::attr(href)').extract()

        print(urls)
        for url in urls:
            url = response.urljoin(url)
            yield SplashRequest(url=url, callback=self.parse_details)

        multiple_locs_urls = response.css('a.offer-regions__label::attr(href)').extract()
        print(multiple_locs_urls)        
        for url in multiple_locs_urls:
            url = response.urljoin(url)
            yield SplashRequest(url=url, callback=self.parse_details)

        next_page_url = response.css('li.pagination_element--next > a.pagination_trigger::attr(href)').extract_first()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield SplashRequest(url=next_page_url, callback=self.parse)

    def parse_details(self, response): 
        yield {
        'title': response.css('#jobTitle').extract_first(),
        'content': response.css('#description').extract_first(),
        'datePosted': response.css('span[itemprop="datePosted"]').extract_first(),
        'address': response.css('span[itemprop="address"]').extract_first()
        }

Run Code Online (Sandbox Code Playgroud)

The problem I am facing is that the multiple_locs_url response.css returns an empty array despite me seeing it in the markup on the browser side.

I checked with scrapy shell and scrapy shell does not see the markup. I guess this is due to the markup being rendered through javascript when the page is loaded.

I added splash but that does not seem to apply to response. How would I make scrapy wait with the query until the page is loaded?

Answer 1

Ruf*_*fat 0

请参阅页面的源代码： view-source:pracuj.pl/praca/polska;ct,1 。html 代码中没有类为“offer-regions__label”的元素。

此代码将始终返回一个空列表：

multiple_locs_urls = response.css('a.offer-regions__label::attr(href)')

Run Code Online (Sandbox Code Playgroud)

但正如这里所解释的/sf/answers/1238813061/：

很多时候，我们在抓取时会遇到这样的问题：页面上呈现的内容是用 Javascript 生成的，因此 scrapy 无法对其进行抓取。

在这种情况下，您可以使用 Selenium。我更改了您的代码并检查了它并且它有效：

class ExtractionSpider(scrapy.Spider):
    name = "extraction"
    allowed_domains = ['domain']
    start_urls = ['http://somedomainstart']
    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"

    def __init__( self, **kwargs ):

        super().__init__( **kwargs )

        profile = webdriver.FirefoxProfile( "pathToFirefoxProfile" )
        firefox_binary = "pathToFirefoxBinary"  # Must be the developer edition!!!
        # self.driver = webdriver.Firefox()
        self.driver = webdriver.Firefox( profile, firefox_binary = firefox_binary )

    def parse(self, response):

        self.driver.get( response.url )

        elements = self.driver.find_elements_by_css_selector( "a.offer-details__title-link" )
        self.driver.get( response.url )
        for element in elements:
            print( "****" )
            print( str( element.get_attribute( "href" ) ) )
            print( str( element.text ) )

        # your old code below

        urls = response.css('a.offer-details__title-link::attr(href)').extract()

        print(urls)
        for url in urls:
            url = response.urljoin(url)
            yield SplashRequest(url=url, callback=self.parse_details)

        multiple_locs_urls = response.css('a.offer-regions__label::attr(href)').extract()
        print(multiple_locs_urls)        
        for url in multiple_locs_urls:
            url = response.urljoin(url)
            yield SplashRequest(url=url, callback=self.parse_details)

        next_page_url = response.css('li.pagination_element--next > a.pagination_trigger::attr(href)').extract_first()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield SplashRequest(url=next_page_url, callback=self.parse)

    def parse_details(self, response): 
        yield {
        'title': response.css('#jobTitle').extract_first(),
        'content': response.css('#description').extract_first(),
        'datePosted': response.css('span[itemprop="datePosted"]').extract_first(),
        'address': response.css('span[itemprop="address"]').extract_first()
        }

Run Code Online (Sandbox Code Playgroud)

`view-source:...` 仅显示由初始请求下载的 HTML，而不是完全渲染的 DOM。这就是他已经在使用 Splash 的原因 (2认同)

归档时间：	6 年，5 月前
查看次数：	202 次
最近记录：	6 年，5 月前