wiz*_*ard 6 python memory-leaks beautifulsoup python-requests
因此,基本思想是通过使用beautifulsoup删除HTML标记和脚本来获取对某些列表URL的请求并从这些页面源解析文本.python版本2.7
问题是,在每次请求时,解析器函数都会在每次请求时不断添加内存.尺寸逐渐增大.
def get_text_from_page_source(page_source):
soup = BeautifulSoup(open(page_source),'html.parser')
# soup = BeautifulSoup(page_source,"lxml")
# kill all script and style elements
for script in soup(["script", "style"]):
script.decompose() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
# print text
return text
Run Code Online (Sandbox Code Playgroud)
甚至在本地文本文件中解析内存泄漏.例如:
#request 1
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #100 MB
#request 2
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #150 MB
#request 3
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #300 MB
Run Code Online (Sandbox Code Playgroud)
您可以尝试soup.decompose在结束函数之前运行get_text_from_page_source来销毁树。
如果您打开一个文本文件而不是直接提供请求内容,如下所示:
soup = BeautifulSoup(open(page_source),'html.parser')
Run Code Online (Sandbox Code Playgroud)
完成后记得将其关闭。为了保持简短,您可以将该行更改为:
with open(page_source, 'r') as html_file:
soup = BeautifulSoup(html_file.read(),'html.parser')
Run Code Online (Sandbox Code Playgroud)