Sky*_*erX 1 python nltk python-3.x
我正在使用 nltk 模块从大约 210 万个关键字的单词列表中检查良好的英语单词。从文本文件中读取单词,然后检查是否是正确的英语单词,然后将正确的单词写入文本文件。该脚本运行良好,但速度慢得离谱,大约每秒 7 次迭代。有没有更快的方法来做到这一点?
这是我的代码:
import nltk
from nltk.corpus import words
from tqdm import tqdm
total_size = 2170503
with open('two_words.txt','r',encoding='utf-8') as file:
for word in tqdm(file,total=total_size):
word = word.strip()
if all([w in words.words() for w in word.split()]):
with open('good_two.txt', 'a', encoding='utf-8') as file:
file.write(word)
file.write('\n')
else:
pass
Run Code Online (Sandbox Code Playgroud)
有没有更快的方法来做同样的事情?IE 通过使用 wordnet 或任何其他建议?
from nltk.corpus import words
import time
# Test Text
text = "she sell sea shell by the seashore"
# Original Method
start = time.time()
x = all([w in words.words() for w in "she sell sea shell by the seashore".split()])
print("Duration Original Method: ", time.time() - start)
# Time to convert words to set
start = time.time()
set_words = set(words.words())
print("Time to generate set: ", time.time() - start)
# Test Using Set (Singe iteration)
start = time.time()
x = all([w in set_words for w in "she sell sea shell by the seashore".split()])
print("Set using 1 iteration: ", time.time() - start)
# Test Using Set (10, 000 iterations)
start = time.time()
for k in range(100000):
x = all([w in set_words for w in "she sell sea shell by the seashore".split()])
print("Set using 100, 000 iterations: ", time.time() - start)
Run Code Online (Sandbox Code Playgroud)
结果显示使用集速度提高了约 200,000 倍。这与words.words()有236, 736个元素有关,因此n ~ 236, 736 但是,我们通过使用集合将每次查找的时间从O(n)减少到O(1)
Duration Original Method: 0.601 seconds
Time to generate set: 0.131 seconds
Set using 1 iteration: 0.0 seconds
Set using 100, 000 iterations: 0.304 seconds
Run Code Online (Sandbox Code Playgroud)