BeautifulSoup:RuntimeError:超出最大递归深度

alm*_*ann 6 python recursion runtime-error beautifulsoup

我无法使用BeautifulSoup避免最大递归深度Python RuntimeError.

我试图通过嵌套的代码部分进行递归并提取内容.美化的HTML看起来像这样(不要问为什么它看起来像这:)):

<div><code><code><code><code>Code in here</code></code></code></code></div>
Run Code Online (Sandbox Code Playgroud)

我传递汤对象的功能是:

def _strip_descendent_code(self, soup):
    sys.setrecursionlimit(2000)
    # soup = BeautifulSoup(html, 'lxml')
    for code in soup.findAll('code'):
        s = ""
        for c in code.descendents:
            if not isinstance(c, NavigableString):
                if c.name != code.name:
                    continue
                elif c.name == code.name:
                    if isinstance(c, NavigableString):
                        s += str(c)
                    else:
                        continue
        code.append(s)
    return str(soup)
Run Code Online (Sandbox Code Playgroud)

您可以看到我正在尝试增加默认的递归限制,但这不是一个解决方案.我已经增加到C达到计算机内存限制的程度,上面的功能永远不会起作用.

任何帮助让这个工作,并指出错误/ s将非常感激.

堆栈跟踪重复此:

  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1234, in find
    l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1255, in find_all
    return self._find_all(name, attrs, text, limit, generator, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 529, in _find_all
    i = next(generator)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1269, in descendants
    stopNode = self._last_descendant().next_element
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 284, in _last_descendant
    if is_initialized and self.next_sibling:
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 997, in __getattr__
    return self.find(tag)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1234, in find
    l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1255, in find_all
    return self._find_all(name, attrs, text, limit, generator, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 529, in _find_all
    i = next(generator)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1269, in descendants
    stopNode = self._last_descendant().next_element
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 284, in _last_descendant
    if is_initialized and self.next_sibling:
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 997, in __getattr__
    return self.find(tag)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1234, in find
    l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1255, in find_all
    return self._find_all(name, attrs, text, limit, generator, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 512, in _find_all
    strainer = SoupStrainer(name, attrs, text, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1548, in __init__
    self.text = self._normalize_search_value(text)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1553, in _normalize_search_value
    if (isinstance(value, str) or isinstance(value, collections.Callable) or hasattr(value, 'match')
RuntimeError: maximum recursion depth exceeded while calling a Python object
Run Code Online (Sandbox Code Playgroud)

小智 10

我遇到过这个问题并浏览了很多网页.我总结了两种方法来解决这个问题.

但是,我想我们应该知道为什么会这样.Python限制了递归的数量(默认数量为1000).我们可以看到这个号码print sys.getrecursionlimit().我想BeautifulSoup使用递归来查找子元素.当递归超过1000次时,RuntimeError: maximum recursion depth exceeded将出现.

第一种方法:使用sys.setrecursionlimit()set有限数量的递归.你显然可以设置1000000,但可能会导致segmentation fault.

第二种方法:使用try-except.如果出现 maximum recursion depth exceeded,我们的算法可能会有问题.一般来说,我们可以使用循环而不是递归.在您的问题中,我们可以replace()提前处理HTML 或正则表达式.

最后,我举一个例子.

from bs4 import BeautifulSoup
import sys   
#sys.setrecursionlimit(10000)

try:
    doc = ''.join(['<br>' for x in range(1000)])
    soup = BeautifulSoup(doc, 'html.parser')
    a = soup.find('br')
    for i in a:
        print i
except:
    print 'failed'
Run Code Online (Sandbox Code Playgroud)

如果删除它#,它可以打印doc.

希望能帮到你.

  • 但是,为什么OP的示例代码会有1000次递归? (2认同)

ngo*_*pal 5

我不确定为什么这有效(我没有检查源代码),但添加.textor.get_text()似乎可以绕过我的错误。

例如,改变

lambda x: BeautifulSoup(x, 'html.parser')

lambda x: BeautifulSoup(x, 'html.parser').get_text()似乎可以正常工作而不会引发递归深度错误。