Mar*_*orn 5 python nlp corpus nltk
我有一组文档,我想返回一个元组列表,其中每个元组具有给定文档的日期以及给定搜索项在该文档中出现的次数.我的代码(下面)有效,但速度很慢,而且我是n00b.是否有明显的方法可以加快速度?任何帮助都会非常感激,主要是因为我可以学习更好的编码,但也可以让我更快地完成这个项目!
def searchText(searchword):
counts = []
corpus_root = 'some_dir'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
for id in wordlists.fileids():
date = id[4:12]
month = date[-4:-2]
day = date[-2:]
year = date[:4]
raw = wordlists.raw(id)
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
count = text.count(searchword)
counts.append((month, day, year, count))
return counts
Run Code Online (Sandbox Code Playgroud)
如果您只想要字数的频率,那么您不需要创建nltk.Text对象,甚至不需要使用nltk.PlainTextReader.相反,直接去nltk.FreqDist.
files = list_of_files
fd = nltk.FreqDist()
for file in files:
with open(file) as f:
for sent in nltk.sent_tokenize(f.lower()):
for word in nltk.word_tokenize(sent):
fd.inc(word)
Run Code Online (Sandbox Code Playgroud)
或者,如果您不想进行任何分析 - 只需使用dict.
files = list_of_files
fd = {}
for file in files:
with open(file) as f:
for sent in nltk.sent_tokenize(f.lower()):
for word in nltk.word_tokenize(sent):
try:
fd[word] = fd[word]+1
except KeyError:
fd[word] = 1
Run Code Online (Sandbox Code Playgroud)
使用生成器表达式可以使这些更有效,但我用于循环以提高可读性.