sam*_*sam 4 python recursion web-crawler
出于学术和性能的考虑,鉴于这种爬行递归式网页爬行功能(仅在给定域内进行爬网),使迭代运行的最佳方法是什么?目前运行时,当它完成时,python已经攀升到使用超过1GB的内存,而这在共享环境中运行是不可接受的.
def crawl(self, url):
"Get all URLS from which to scrape categories."
try:
links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
except urllib2.HTTPError:
return
for link in links:
for attr in link.attrs:
if Crawler._match_attr(attr):
if Crawler._is_category(attr):
pass
elif attr[1] not in self._crawled:
self._crawled.append(attr[1])
self.crawl(attr[1])
Run Code Online (Sandbox Code Playgroud)
Meh*_*ari 12
使用BFS而不是递归爬行(DFS):http://en.wikipedia.org/wiki/Breadth_first_search
您可以使用外部存储解决方案(例如数据库)来获取BFS队列以释放RAM.
算法是:
//pseudocode:
var urlsToVisit = new Queue(); // Could be a queue (BFS) or stack(DFS). (probably with a database backing or something).
var visitedUrls = new Set(); // List of visited URLs.
// initialization:
urlsToVisit.Add( rootUrl );
while(urlsToVisit.Count > 0) {
var nextUrl = urlsToVisit.FetchAndRemoveNextUrl();
var page = FetchPage(nextUrl);
ProcessPage(page);
visitedUrls.Add(nextUrl);
var links = ParseLinks(page);
foreach (var link in links)
if (!visitedUrls.Contains(link))
urlsToVisit.Add(link);
}
Run Code Online (Sandbox Code Playgroud)