Vol*_*ort 7 python scrapy web-scraping
我试图从需要身份验证的网站上抓取数据.
我已经能够成功登录使用请求和HttpNtlmAuth与以下内容:
s = requests.session()
url = "https://website.com/things"
response = s.get(url, auth=HttpNtlmAuth('DOMAIN\\USERNAME','PASSWORD'))
Run Code Online (Sandbox Code Playgroud)
我想探索Scrapy的功能,但是我无法成功进行身份验证.
我遇到了以下中间件,看起来它可以工作,但我认为我没有正确实现它:
https://github.com/reimund/ntlm-middleware/blob/master/ntlmauth.py
在我的settings.py中,我有
SPIDER_MIDDLEWARES = { 'test.ntlmauth.NtlmAuthMiddleware': 400, }
Run Code Online (Sandbox Code Playgroud)
在我的蜘蛛班里,我有
http_user = 'DOMAIN\\USER'
http_pass = 'PASS'
Run Code Online (Sandbox Code Playgroud)
我无法让这个工作.
如果有人能够成功地从具有NTLM身份验证的网站上搜索,可以指出我正确的方向,我将不胜感激.
我能够弄清楚发生了什么。
1:这被视为“ DOWNLOADER_MIDDLEWARE”,而不是“ SPIDER_MIDDLEWARE”。
DOWNLOADER_MIDDLEWARES = { 'test.ntlmauth.NTLM_Middleware': 400, }
Run Code Online (Sandbox Code Playgroud)
2:我尝试使用的中间件需要进行重大修改。这对我有用:
from scrapy.http import Response
import requests
from requests_ntlm import HttpNtlmAuth
class NTLM_Middleware(object):
def process_request(self, request, spider):
url = request.url
pwd = getattr(spider, 'http_pass', '')
usr = getattr(spider, 'http_user', '')
s = requests.session()
response = s.get(url,auth=HttpNtlmAuth(usr,pwd))
return Response(url,response.status_code,{}, response.content)
Run Code Online (Sandbox Code Playgroud)
在Spider中,您所需要做的就是设置以下变量:
http_user = 'DOMAIN\\USER'
http_pass = 'PASS'
Run Code Online (Sandbox Code Playgroud)
谢谢@SpaceDog 上面的评论,我在尝试使用 ntlm 身份验证抓取 Intranet 网站时遇到了类似的问题。爬虫只会看到第一页,因为 CrawlSpider 中的 LinkExtractor 没有启动。
这是我使用scrapy 1.0.5的工作解决方案
NTLM_Middleware.py
from scrapy.http import Response, HtmlResponse
import requests
from requests_ntlm import HttpNtlmAuth
class NTLM_Middleware(object):
def process_request(self, request, spider):
url = request.url
usr = getattr(spider, 'http_usr', '')
pwd = getattr(spider, 'http_pass','')
s = requests.session()
response = s.get(url, auth=HttpNtlmAuth(usr,pwd))
return HtmlResponse(url,response.status_code, response.headers.iteritems(), response.content)
Run Code Online (Sandbox Code Playgroud)
设置.py
import logging
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'scrapy intranet'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS=16
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'intranet.NTLM_Middleware.NTLM_Middleware': 200,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':None
}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline',
}
ELASTICSEARCH_SERVER='localhost'
ELASTICSEARCH_PORT=9200
ELASTICSEARCH_USERNAME=''
ELASTICSEARCH_PASSWORD=''
ELASTICSEARCH_INDEX='intranet'
ELASTICSEARCH_TYPE='pages_intranet'
ELASTICSEARCH_UNIQ_KEY='url'
ELASTICSEARCH_LOG_LEVEL=logging.DEBUG
Run Code Online (Sandbox Code Playgroud)
蜘蛛/内网蜘蛛.py
# -*- coding: utf-8 -*-
import scrapy
#from scrapy import log
from scrapy.spiders import CrawlSpider, Rule, Spider
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.http import Response
import requests
import sys
from bs4 import BeautifulSoup
class PageItem(scrapy.Item):
body=scrapy.Field()
title=scrapy.Field()
url=scrapy.Field()
class IntranetspiderSpider(CrawlSpider):
http_usr='DOMAIN\\user'
http_pass='pass'
name = "intranetspider"
protocol='https://'
allowed_domains = ['intranet.mydomain.ca']
start_urls = ['https://intranet.mydomain.ca/']
rules = (Rule(LinkExtractor(),callback="parse_items",follow=True),)
def parse_items(self, response):
self.logger.info('Crawl de la page %s',response.url)
item = PageItem()
soup = BeautifulSoup(response.body)
#remove script tags and javascript from content
[x.extract() for x in soup.findAll('script')]
item['body']=soup.get_text(" ", strip=True)
item['url']=response.url
return item
Run Code Online (Sandbox Code Playgroud)