小编ama*_*ets的帖子

Python urllib2和[errno 10054]远程主机强行关闭现有连接和一些urllib2问题

我编写了一个使用urllib2来获取网址的抓取工具.

每一个请求我得到一些奇怪的行为,我尝试用wireshark分析它,无法理解问题.

getPAGE()负责获取url.如果成功获取url,则返回url(response.read())的内容,否则返回None.

def getPAGE(FetchAddress):
    attempts = 0
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0'}
    while attempts < 2:
        req = Request(FetchAddress, None ,headers)
        try:
            response = urlopen(req) #fetching the url
        except HTTPError, e:
            print 'The server didn\'t do the request.'
            print 'Error code: ', str(e.code) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except URLError, e:
            print 'Failed to reach the server.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4) …
Run Code Online (Sandbox Code Playgroud)

python exception urllib2 errno web-crawler

7
推荐指数
0
解决办法
8104
查看次数

飞溅容器在30分钟后停止工作

我有水族馆和飞溅的问题.他们在开始后30分钟后停止工作.加载的页数为50K-80K.我做了cron工作,每10分钟自动重启一次,每个Splash容器,但它不起作用我该如何解决?

这是截图 在此输入图像描述 来自HAProxy的统计数据 在此输入图像描述 这是Splash配置

splash0:
    image: scrapinghub/splash:3.0
    command: --max-timeout 3600 --slots 150 --maxrss 1000 --verbosity 5
    logging:
      driver: "none"
    expose:
        - 8050
    mem_limit: 1000m
    memswap_limit: 1000m
    restart: always
Run Code Online (Sandbox Code Playgroud)

和HAProxy

    backend splash-cluster
    option httpchk GET /
    balance leastconn

    # try another instance when connection is dropped
    retries 2
    option redispatch
    server splash-0 splash0:8050 check maxconn 150 inter 2s fall 10 observe layer4
backend splash-misc
    balance roundrobin
    server splash-0 splash0:8050 check fall 15
Run Code Online (Sandbox Code Playgroud)

更新1 这是重启的脚本

    #!/bin/sh

echo "BEGIN" >> restart.log
for index …
Run Code Online (Sandbox Code Playgroud)

haproxy docker splash-js-render

7
推荐指数
0
解决办法
477
查看次数