KeyError:'\\ documentclass'

Sim*_*ity 1 python nltk

我有以下Python脚本:

import nltk
from nltk.probability import FreqDist
nltk.download('punkt')

frequencies = {}
book = open('book.txt')
read_book = book.read()
words = nltk.word_tokenize(read_book)
frequencyDist = FreqDist(words)

for w in words:
    frequencies[w] = frequencies[w] + 1 

print (frequencies)
Run Code Online (Sandbox Code Playgroud)

当我尝试运行脚本时,我得到以下内容:

[nltk_data] Downloading package punkt to /home/abc/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Traceback (most recent call last):
  File "test.py", line 12, in <module>
    frequencies[w] = frequencies[w] + 1 
KeyError: '\\documentclass'
Run Code Online (Sandbox Code Playgroud)

我究竟做错了什么?并且,如何在文本文件中打印单词及其出现次数.

你可以book.txt这里下载.

Jea*_*bre 6

你的frequencies字典是空的.你从一开始就得到了关键错误,这是预期的.

我建议你collections.Counter改用.它是一个专门的字典(有点像defaultdict),它允许计算出现次数.

import nltk,collections
from nltk.probability import FreqDist
nltk.download('punkt')

frequencies = collections.Counter()
with open('book.txt') as book:
    read_book = book.read()
words = nltk.word_tokenize(read_book)
frequencyDist = FreqDist(words)

for w in words:
    frequencies[w] += 1 

print (frequencies)
Run Code Online (Sandbox Code Playgroud)

编辑:这回答你的问题,而不使用ntlk包.我回答就像nltkpackage只是一个字符串标记器.所以更具体一点,允许在不重新发明轮子的情况下进一步进行文本分析,并且由于下面的各种评论,你应该这样做:

import nltk
from nltk.probability import FreqDist
nltk.download('punkt')

with open('book.txt') as book:
    read_book = book.read()
words = nltk.word_tokenize(read_book)
frequencyDist = FreqDist(words)   # no need for the loop, does the count job

print (frequencyDist)
Run Code Online (Sandbox Code Playgroud)

你会得到(用我的文字):

<FreqDist with 142 samples and 476 outcomes>
Run Code Online (Sandbox Code Playgroud)

所以不是一个字典的单词=>元素数量直接,但更复杂的对象承载这些信息+更多:

  • frequencyDist.items():你得到的话=> count(以及所有经典的dict方法)
  • frequencyDist.most_common(50) 打印50个最常见的单词
  • frequencyDist['the'] 返回的出现次数 "the"
  • ...

  • 而不是使用`collections.Counter`,我建议使用构建然后完全忽略的'FreqDist`.它基本上是一个`Counter`,附加了一些额外的NLTK实用程序. (2认同)