尝试使用 Scrapy 抓取 LinkedIn 时出现 999 响应

Question

尝试使用 Scrapy 抓取 LinkedIn 时出现 999 响应

我正在尝试使用 Scrapy 框架从 LinkedIn 中提取一些信息。我知道他们对尝试抓取其网站的人非常严格，因此我在 settings.py 中尝试了不同的用户代理。我还指定了较高的下载延迟，但它似乎仍然立即阻止了我。

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 2
REDIRECT_ENABLED = False
RETRY_ENABLED = False
DEPTH_LIMIT = 5
DOWNLOAD_TIMEOUT = 10
REACTOR_THREADPOOL_MAXSIZE = 20
CONCURRENT_REQUESTS_PER_DOMAIN = 2
COOKIES_ENABLED = False
HTTPCACHE_ENABLED = True

Run Code Online (Sandbox Code Playgroud)

这是我收到的错误：

2017-03-20 19:11:29 [scrapy.core.engine] INFO: Spider opened
2017-03-20 19:11:29 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min)
2017-03-20 19:11:29 [scrapy.extensions.telnet] DEBUG: Telnet console listening on
127.0.0.1:6023
2017-03-20 19:11:29 [scrapy.core.engine] DEBUG: Crawled (999) <GET
https://www.linkedin.com/directory/people-1/> (referer: None) ['cached']
2017-03-20 19:11:29 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response
<999 https://www.linkedin.com/directory/people-1/>: HTTP status code is not handled or 
not allowed
2017-03-20 19:11:29 [scrapy.core.engine] INFO: Closing spider (finished)
2017-03-20 19:11:29 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 282,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2372,
'downloader/response_count': 1,
'downloader/response_status_count/999': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 3, 20, 17, 11, 29, 503000),
'httpcache/hit': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 8,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 3, 20, 17, 11, 29, 378000)}
2017-03-20 19:11:29 [scrapy.core.engine] INFO: Spider closed (finished)

Run Code Online (Sandbox Code Playgroud)

蜘蛛本身只是打印访问过的网址。

class InfoSpider(CrawlSpider):
    name = "info"
    allowed_domains = ["www.linkedin.com"]
    start_urls = ['https://www.linkedin.com/directory/people-1/']
    rules = [
        Rule(LinkExtractor(
            allow=[r'.*']),
            callback='parse',
            follow=True)
    ]
    def parse(self, response):
        print(response.url)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Rob*_*bot 1

您必须先登录linkedin，然后才能抓取任何其他页面。使用scrapy登录可以参考https://doc.scrapy.org/en/latest/topics/request-response.html#formrequest-objects

更新 1：这是我的代码的示例。

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.spiders.init import InitSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request, FormRequest
from linkedin.items import *

class LinkedinSpider(InitSpider):

"""
Define the crawler's start URIs, set its follow rules, parse HTML
and assign values to an item. Processing occurs in ../pipelines.py
"""

name = "linkedin"
allowed_domains = ["linkedin.com"]
user_name = 'my_user_name'
passwd = 'my_passwd'

# Uncomment the following lines for full spidering
# start_urls = ["http://www.linkedin.com/directory/people-%s-%d-%d-%d"
#               % (alphanum, num_one, num_two, num_three)
#                 for alphanum in "abcdefghijklmnopqrstuvwxyz"
#                 for num_one in xrange(1,11)
#                 for num_two in xrange(1,11)
#                 for num_three in xrange(1,11)
#               ]

# Temporary start_urls for testing; remove and use the above start_urls in production
# start_urls = ["http://www.linkedin.com/directory/people-a-23-23-2"]
start_urls = ["https://www.linkedin.com/in/rebecca-liu-93a12a28/"]
login_page = 'https://www.linkedin.com/uas/login'
# TODO: allow /in/name urls too?
# rules = (
#     Rule(SgmlLinkExtractor(allow=('\/pub\/.+')),
#          callback='parse_item'),
# )

def init_request(self):
    return Request(url=self.login_page,callback=self.login)

def login(self,response):
    return FormRequest.from_response(response,formdata={
        'session_key':user_name,'session_password':passwd
    },
                                     callback = self.check_login_response)

def check_login_response(self,response):
    return self.initialized()

Run Code Online (Sandbox Code Playgroud)

如果您使用InitSpider ，请记住调用self.initialized()，否则不会调用parse()方法。

归档时间：	9 年，2 月前
查看次数：	6072 次
最近记录：	8 年，10 月前