动态更改IP地址?

Mag*_*n V 47 ip web-crawler dynamic-ip scrapy web-scraping

考虑一下这种情况,我想经常抓取网站,但是我的IP地址在某天/限制后被阻止了.

那么,如何动态更改我的IP地址或任何其他想法?

abe*_*rna 43

使用Scrapy的方法将使用两个组件a RandomProxy和a,RotateUserAgentMiddleware 并进行DOWNLOADER_MIDDLEWARES如下修改:

DOWNLOADER_MIDDLEWARS

您必须在中插入新组件 settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 90,
    'tutorial.randomproxy.RandomProxy': 100,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
    'tutorial.spiders.rotate_useragent.RotateUserAgentMiddleware' :400,    
}
Run Code Online (Sandbox Code Playgroud)

随机代理:
此组件将使用列表中的随机代理处理Scrapy请求,以避免IP禁用并提高爬网速度.

更多详细信息:(https://github.com/aivarsk/scrapy-proxies)您可以通过快速的互联网搜索建立代理列表.根据请求的url格式复制list.txt文件中的链接.

用户代理的轮换

对于每个scrapy请求,将从您预先定义的列表中使用随机用户代理

class RotateUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        ua = random.choice(self.user_agent_list)
        if ua:
            request.headers.setdefault('User-Agent', ua)

            # Add desired logging message here.
            spider.log(
                u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request),
                level=log.DEBUG
            )

    # the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
    # for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
    user_agent_list = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
Run Code Online (Sandbox Code Playgroud)

更多细节:(https://gist.github.com/seagatesoft/e7de4e3878035726731d)

  • 如何获取代理列表?有人吗?帮助 (2认同)

小智 9

您可以尝试使用代理服务器来防止被阻止.有提供工作代理的服务.我尝试过的最好的是https://gimmeproxy.com - 他们经常检查代理的各种参数.

要从它们获取代理,您只需要发出以下请求:

https://gimmeproxy.com/api/getProxy
Run Code Online (Sandbox Code Playgroud)

他们将提供JSON响应以及您可以在以后根据需要使用的所有代理数据:

{
  "supportsHttps": true,
  "protocol": "socks5",
  "ip": "179.162.22.82",
  "port": "36915",
  "get": true,
  "post": true,
  "cookies": true,
  "referer": true,
  "user-agent": true,
  "anonymityLevel": 1,
  "websites": {
    "example": true,
    "google": false,
    "amazon": true
  },
  "country": "BR",
  "tsChecked": 1517952910,
  "curl": "socks5://179.162.22.82:36915",
  "ipPort": "179.162.22.82:36915",
  "type": "socks5",
  "speed": 37.78,
  "otherProtocols": {}
}
Run Code Online (Sandbox Code Playgroud)

您可以像Curl一样使用它:

curl -x socks5://179.162.22.82:36915 http://example.com
Run Code Online (Sandbox Code Playgroud)


小智 7

如果您使用R,则可以通过TOR进行网络爬行.我认为TOR会自动每10分钟(?)重置一次IP地址.我认为有一种方法可以迫使TOR在更短的时间内更改IP,但这对我不起作用.相反,您可以设置多个TOR实例,然后在独立实例之间切换(在这里您可以找到如何设置多个TOR实例的详细说明:https://tor.stackexchange.com/questions/2006/how- to-run-multiple-tor-browsers-with-different-ips)

之后,您可以在R中执行以下操作(使用独立TOR浏览器的端口和使用者列表.每次通过端口/用户列表调用'getURL'功能循环)

library(RCurl)

port <- c(a list of your ports)
proxy <- paste("socks5h://127.0.0.1:",port,sep="")
ua <- c(a list of your useragents)

opt <- list(proxy=sample(proxy,1),
            useragent=sample(ua,1),
            followlocation=TRUE,
            referer="",
            timeout=timeout,
            verbose=verbose,
            ssl.verifypeer=ssl)

webpage <- getURL(url=url,.opts=opt)
Run Code Online (Sandbox Code Playgroud)