var*_*rin 4 python mechanize web-crawler scraper
我正在使用Mechanize和Beautiful汤从Delicious中删除一些数据
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
mech = Browser()
url = "http://www.delicious.com/varunsrin"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
print soup.prettify()
Run Code Online (Sandbox Code Playgroud)
这适用于我抛出它的大多数网站,但是使用以下输出在Delicious上失败
Traceback (most recent call last):
File "C:\Users\Varun\Desktop\Python-3.py",
line 7, in <module>
page = mech.open(url)
File "C:\Python26\lib\site-packages\mechanize\_mechanize.py",
line 203, in open
return self._mech_open(url, data, timeout=timeout) File
"C:\Python26\lib\site-packages\mechanize\_mechanize.py",
line 255, in _mech_open
raise response httperror_seek_wrapper: HTTP Error
403: request disallowed by robots.txt
C:\Program Files (x86)\ActiveState Komodo IDE 6\lib\support\dbgp\pythonlib\dbgp\client.py:1360:
DeprecationWarning:
BaseException.message has been deprecated as of Python 2.6
child = getattr(self.value, childStr)
C:\Program Files (x86)\ActiveState Komodo IDE 6\lib\support\dbgp\pythonlib\dbgp\client.py:456:
DeprecationWarning:
BaseException.message has been deprecated as of Python 2.6
return apply(func, args)
Run Code Online (Sandbox Code Playgroud)
从这里获取一些使用python + mechanize模拟浏览器的技巧.添加addheaders并且set_handle_robots似乎是最低要求.使用下面的代码,我得到输出:
from mechanize import Browser, _http
from BeautifulSoup import BeautifulSoup
br = Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
url = "http://www.delicious.com/varunsrin"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
print soup.prettify()
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4032 次 |
| 最近记录: |