nltk单词语料库不包含"好"?

Mon*_*lto 9 python dictionary corpus nltk

NLTK单词语料库没有短语"okay","ok","Okay"?

> from nltk.corpus import words
> words.words().__contains__("check")
> True

> words.words().__contains__("okay")
> False

> len(words.words())
> 236736
Run Code Online (Sandbox Code Playgroud)

有什么想法吗?

alv*_*vas 10

TL; DR

from nltk.corpus import words
from nltk.corpus import wordnet 

manywords = words.words() + wordnet.words() 
Run Code Online (Sandbox Code Playgroud)

在龙

文档中,nltk.corpus.words单词是" http://en.wikipedia.org/wiki/Words_(Unix) "中的单词列表

在Unix中,您可以这样做:

ls /usr/share/dict/
Run Code Online (Sandbox Code Playgroud)

阅读自述文件:

$ cd /usr/share/dict/
/usr/share/dict$ cat README
#   @(#)README  8.1 (Berkeley) 6/5/93
# $FreeBSD$

WEB ---- (introduction provided by jaw@riacs) -------------------------

Welcome to web2 (Webster's Second International) all 234,936 words worth.
The 1934 copyright has lapsed, according to the supplier.  The
supplemental 'web2a' list contains hyphenated terms as well as assorted
noun and adverbial phrases.  The wordlist makes a dandy 'grep' victim.

     -- James A. Woods    {ihnp4,hplabs}!ames!jaw    (or jaw@riacs)

Country names are stored in the file /usr/share/misc/iso3166.


FreeBSD Maintenance Notes ---------------------------------------------

Note that FreeBSD is not maintaining a historical document, we're
maintaining a list of current [American] English spellings.

A few words have been removed because their spellings have depreciated.
This list of words includes:
    corelation (and its derivatives)    "correlation" is the preferred spelling
    freen               typographical error in original file
    freend              archaic spelling no longer in use;
                    masks common typo in modern text

--

A list of technical terms has been added in the file 'freebsd'.  This
word list contains FreeBSD/Unix lexicon that is used by the system
documentation.  It makes a great ispell(1) personal dictionary to
supplement the standard English language dictionary.
Run Code Online (Sandbox Code Playgroud)

由于它是234,936的固定列表,因此该列表中不存在必然存在的单词.

如果您需要扩展单词列表,可以使用WordNet中的单词添加到列表中nltk.corpus.wordnet.words().

最有可能的是,你需要的只是一个足够大的文本语料库,例如维基百科转储,然后将其标记化并提取所有独特的单词.

  • 得到这个:TypeError: can only concatenate list (not "dict_keyiterator") to list (4认同)
  • "有一个语料库包含你正在寻找的单词'并没有真正回答'为什么这个语料库不包含这个词'.也不清楚'TL; DR'是如何适用或解释的. (2认同)
  • 从nltk.corpus导入wordnet作为wn? (2认同)