如何使这个递归爬行函数迭代?

sam*_*sam 4 python recursion web-crawler

出于学术和性能的考虑,鉴于这种爬行递归式网页爬行功能(仅在给定域内进行爬网),使迭代运行的最佳方法是什么?目前运行时,当它完成时,python已经攀升到使用超过1GB的内存,而这在共享环境中运行是不可接受的.

   def crawl(self, url):
    "Get all URLS from which to scrape categories."
    try:
      links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
    except urllib2.HTTPError:
      return
    for link in links:
      for attr in link.attrs:
        if Crawler._match_attr(attr):
          if Crawler._is_category(attr):
            pass
          elif attr[1] not in self._crawled:
            self._crawled.append(attr[1])
            self.crawl(attr[1])
Run Code Online (Sandbox Code Playgroud)

Meh*_*ari 12

使用BFS而不是递归爬行(DFS):http://en.wikipedia.org/wiki/Breadth_first_search

您可以使用外部存储解决方案(例如数据库)来获取BFS队列以释放RAM.

算法是:

//pseudocode:
var urlsToVisit = new Queue(); // Could be a queue (BFS) or stack(DFS). (probably with a database backing or something).
var visitedUrls = new Set(); // List of visited URLs.

// initialization:
urlsToVisit.Add( rootUrl );

while(urlsToVisit.Count > 0) {
  var nextUrl = urlsToVisit.FetchAndRemoveNextUrl();
  var page = FetchPage(nextUrl);
  ProcessPage(page);
  visitedUrls.Add(nextUrl);
  var links = ParseLinks(page);
  foreach (var link in links)
     if (!visitedUrls.Contains(link))
        urlsToVisit.Add(link); 
}
Run Code Online (Sandbox Code Playgroud)


Ber*_*Ber 5

您可以将新URL抓取到队列中,而不是递归.然后运行直到队列为空而不递归.如果将队列放入文件中,则几乎不使用任何内存.