我编写了一个使用urllib2来获取网址的抓取工具.
每一个请求我得到一些奇怪的行为,我尝试用wireshark分析它,无法理解问题.
getPAGE()负责获取url.如果成功获取url,则返回url(response.read())的内容,否则返回None.
def getPAGE(FetchAddress):
attempts = 0
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0'}
while attempts < 2:
req = Request(FetchAddress, None ,headers)
try:
response = urlopen(req) #fetching the url
except HTTPError, e:
print 'The server didn\'t do the request.'
print 'Error code: ', str(e.code) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
except URLError, e:
print 'Failed to reach the server.'
print 'Reason: ', str(e.reason) + " address: " + FetchAddress
time.sleep(4) …Run Code Online (Sandbox Code Playgroud) 我有水族馆和飞溅的问题.他们在开始后30分钟后停止工作.加载的页数为50K-80K.我做了cron工作,每10分钟自动重启一次,每个Splash容器,但它不起作用我该如何解决?
这是截图
来自HAProxy的统计数据
这是Splash配置
splash0:
image: scrapinghub/splash:3.0
command: --max-timeout 3600 --slots 150 --maxrss 1000 --verbosity 5
logging:
driver: "none"
expose:
- 8050
mem_limit: 1000m
memswap_limit: 1000m
restart: always
Run Code Online (Sandbox Code Playgroud)
和HAProxy
backend splash-cluster
option httpchk GET /
balance leastconn
# try another instance when connection is dropped
retries 2
option redispatch
server splash-0 splash0:8050 check maxconn 150 inter 2s fall 10 observe layer4
backend splash-misc
balance roundrobin
server splash-0 splash0:8050 check fall 15
Run Code Online (Sandbox Code Playgroud)
更新1 这是重启的脚本
#!/bin/sh
echo "BEGIN" >> restart.log
for index …Run Code Online (Sandbox Code Playgroud)