python中的NLTK有一个函数FreqDist,它可以为您提供文本中单词的频率.我试图将我的文本作为参数传递但结果是以下形式:['','e','a','o','n','i','t','r', 's','l','d','h','c','y','b','u','g','\n','m','p',' w','f',',','v','.',''','k','B',''','M','H','9','C' ,' - ','N','S','1','A','G','P','T','W','[',']','(',' )','0','7','E','J','O','R','j','x']而在NLTK网站的例子中,结果是整个单词不仅仅是我是这样做的:
file_y = open(fileurl)
p = file_y.read()
fdist = FreqDist(p)
vocab = fdist.keys()
vocab[:100]
Run Code Online (Sandbox Code Playgroud)
你知道我错了吗?谢谢!
我对 Python 完全陌生。我有以下代码:
class ExtractTitle(sgmllib.SGMLParser):
def __init__(self, verbose=0):
sgmllib.SGMLParser.__init__(self, verbose)
self.title = self.data = None
def handle_data(self, data):
if self.data is not None:
self.data.append(data)
def start_title(self, attrs):
self.data = []
def end_title(self):
self.title = string.join(self.data, "")
raise FoundTitle # abort parsing!
Run Code Online (Sandbox Code Playgroud)
它从 SGML 中提取标题元素,但它仅适用于单个标题。我知道我必须重载unknown_starttag和unknown_endtag才能获得所有标题,但我总是弄错。请帮帮我!!!