小编hti*_*fcs的帖子

从html页面中删除所有样式,脚本和html标记

这是我到目前为止:

from bs4 import BeautifulSoup

def cleanme(html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
    for script in soup(["script"]): 
        script.extract()
    text = soup.get_text()
    return text
testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>"

cleaned = cleanme(testhtml)
print (cleaned)
Run Code Online (Sandbox Code Playgroud)

这是为了删除脚本

html python beautifulsoup

9
推荐指数
3
解决办法
1万
查看次数

标签 统计

beautifulsoup ×1

html ×1

python ×1