使用BeautifulSoup和Requests解析html页面源时出现内存泄漏

Question

使用BeautifulSoup和Requests解析html页面源时出现内存泄漏

wiz*_*ard 6 python memory-leaks beautifulsoup python-requests

因此,基本思想是通过使用beautifulsoup删除HTML标记和脚本来获取对某些列表URL的请求并从这些页面源解析文本.python版本2.7

问题是,在每次请求时,解析器函数都会在每次请求时不断添加内存.尺寸逐渐增大.

def get_text_from_page_source(page_source):
    soup = BeautifulSoup(open(page_source),'html.parser')
#     soup = BeautifulSoup(page_source,"lxml")
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.decompose()    # rip it out
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    # print text
    return text

Run Code Online (Sandbox Code Playgroud)

甚至在本地文本文件中解析内存泄漏.例如:

#request 1
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #100 MB

#request 2
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #150 MB
#request 3
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #300 MB

Run Code Online (Sandbox Code Playgroud)

Answer 1

Pab*_*o M 0

您可以尝试soup.decompose在结束函数之前运行get_text_from_page_source来销毁树。

如果您打开一个文本文件而不是直接提供请求内容，如下所示：

soup = BeautifulSoup(open(page_source),'html.parser')

Run Code Online (Sandbox Code Playgroud)

完成后记得将其关闭。为了保持简短，您可以将该行更改为：

with open(page_source, 'r') as html_file:
    soup = BeautifulSoup(html_file.read(),'html.parser')

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，3 月前
查看次数：	340 次
最近记录：	7 年，3 月前