zin*_*rim 4 splash-screen scrapy web-scraping python-3.x
如何以如下等效的方式使用 Splash 设置 Scrapy 的用户代理:
import requests
from bs4 import BeautifulSoup
ua = {"User-Agent":"Mozilla/5.0"}
url = "http://www.example.com"
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, "lxml")
Run Code Online (Sandbox Code Playgroud)
我的蜘蛛看起来像这样:
import scrapy
from scrapy_splash import SplashRequest
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["https://www.example.com/"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url,
self.parse,
args={'wait': 0.5}
)
Run Code Online (Sandbox Code Playgroud)
您需要设置user_agent属性以覆盖默认用户代理:
class ExampleSpider(scrapy.Spider):
name = 'example'
user_agent = 'Mozilla/5.0'
Run Code Online (Sandbox Code Playgroud)
在这种情况下UserAgentMiddleware(默认情况下启用)将覆盖USER_AGENT设置值为'Mozilla/5.0'。
您还可以覆盖每个请求的标头:
scrapy_splash.SplashRequest(url, headers={'User-Agent': custom_user_agent})
Run Code Online (Sandbox Code Playgroud)
正确的方法是更改启动脚本以将其包含在内...不过,如果它也能正常工作,则不要将其添加到蜘蛛中。
http://splash.readthedocs.io/en/stable/scripting-ref.html?highlight=agent