KVI*_*ISH 45 python beautifulsoup nltk
我正在尝试使用bs4删除所有的html/javascript,但是,它并没有摆脱javascript.我仍然在那里看到它的文字.我怎么能绕过这个?
我试着用nltk然而,工作正常,clean_html并且clean_url将被删除向前发展.有没有办法使用汤get_text并获得相同的结果?
我试着看看这些其他页面:
BeautifulSoup get_text不会删除所有标记和JavaScript
目前我正在使用nltk已弃用的功能.
编辑
这是一个例子:
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()
Run Code Online (Sandbox Code Playgroud)
我仍然看到CNN的以下内容:
$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});
/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});
Run Code Online (Sandbox Code Playgroud)
我怎样才能删除js?
我找到的其他选项是:
https://github.com/aaronsw/html2text
问题html2text在于它有时真的很慢,并且会产生明显的滞后,这是nltk总是非常好的一件事.
Hug*_*ell 85
部分基于我可以使用BeautifulSoup删除脚本标签吗?
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.decompose() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
Run Code Online (Sandbox Code Playgroud)
为了防止编码错误...
import urllib
from bs4 import BeautifulSoup
url = url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text.encode('utf-8'))
Run Code Online (Sandbox Code Playgroud)