小编wiz*_*ard的帖子

使用BeautifulSoup和Requests解析html页面源时出现内存泄漏

因此,基本思想是通过使用beautifulsoup删除HTML标记和脚本来获取对某些列表URL的请求并从这些页面源解析文本.python版本2.7

问题是,在每次请求时,解析器函数都会在每次请求时不断添加内存.尺寸逐渐增大.

def get_text_from_page_source(page_source):
    soup = BeautifulSoup(open(page_source),'html.parser')
#     soup = BeautifulSoup(page_source,"lxml")
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.decompose()    # rip it out
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for …

Run Code Online (Sandbox Code Playgroud)

python memory-leaks beautifulsoup python-requests

wiz*_*ard

lucky-day

6
推荐指数

1
解决办法

340
查看次数