有没有更快的方法可以通过 python 的 nltk 从单词列表中进行检查?

Sky*_*erX 1 python nltk python-3.x

我正在使用 nltk 模块从大约 210 万个关键字的单词列表中检查良好的英语单词。从文本文件中读取单词,然后检查是否是正确的英语单词,然后将正确的单词写入文本文件。该脚本运行良好,但速度慢得离谱,大约每秒 7 次迭代。有没有更快的方法来做到这一点?

这是我的代码:

import nltk
from nltk.corpus import words
from tqdm import tqdm

total_size = 2170503
with open('two_words.txt','r',encoding='utf-8') as file:
    for word in tqdm(file,total=total_size):
        word = word.strip()
        if all([w in words.words() for w in word.split()]):
            with open('good_two.txt', 'a', encoding='utf-8') as file:
                file.write(word)
                file.write('\n')
        else:
            pass
Run Code Online (Sandbox Code Playgroud)

有没有更快的方法来做同样的事情?IE 通过使用 wordnet 或任何其他建议?

Dar*_*ylG 5

您可以通过使用将words.words() 转换为集合来使其更快,如以下测试所示。

from nltk.corpus import words
import time
# Test Text
text = "she sell sea shell by the seashore"

# Original Method
start = time.time()
x = all([w in words.words() for w in "she sell sea shell by the seashore".split()])
print("Duration Original Method: ", time.time() - start)

# Time to convert words to set
start = time.time()
set_words = set(words.words())
print("Time to generate set: ", time.time() - start)

# Test Using Set (Singe iteration)
start = time.time()
x = all([w in set_words for w in "she sell sea shell by the seashore".split()])
print("Set using 1 iteration: ", time.time() - start)

# Test Using Set (10, 000 iterations)
start = time.time()
for k in range(100000):
    x = all([w in set_words for w in "she sell sea shell by the seashore".split()])
print("Set using 100, 000 iterations: ", time.time() - start)
Run Code Online (Sandbox Code Playgroud)

结果显示使用集速度提高了约 200,000 倍。这与words.words()有236, 736个元素有关,因此n ~ 236, 736 但是,我们通过使用集合将每次查找的时间从O(n)减少到O(1)

Duration Original Method:  0.601 seconds
Time to generate set:  0.131 seconds
Set using 1 iteration:  0.0 seconds
Set using 100, 000 iterations:  0.304 seconds
Run Code Online (Sandbox Code Playgroud)