有没有一种简单的方法可以在python中从一个不平等的句子生成一个可能的单词列表?

Ero*_*mic 10 python nlp

我有一些文字:

 s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
Run Code Online (Sandbox Code Playgroud)

我想把它解析成单词.我很快调查了附魔和nltk,但没有看到任何看起来立即有用的东西.如果我有时间投入这个,我会考虑编写一个动态程序,附魔能够检查一个单词是否是英语.我原以为在网上有什么可以做的,我错了吗?

hug*_*own 9

使用trie的贪婪方法

使用Biopython(pip install biopython)尝试这个:

from Bio import trie
import string


def get_trie(dictfile='/usr/share/dict/american-english'):
    tr = trie.trie()
    with open(dictfile) as f:
        for line in f:
            word = line.rstrip()
            try:
                word = word.encode(encoding='ascii', errors='ignore')
                tr[word] = len(word)
                assert tr.has_key(word), "Missing %s" % word
            except UnicodeDecodeError:
                pass
    return tr


def get_trie_word(tr, s):
    for end in reversed(range(len(s))):
        word = s[:end + 1]
        if tr.has_key(word): 
            return word, s[end + 1: ]
    return None, s

def main(s):
    tr = get_trie()
    while s:
        word, s = get_trie_word(tr, s)
        print word

if __name__ == '__main__':
    s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
    s = s.strip(string.punctuation)
    s = s.replace(" ", '')
    s = s.lower()
    main(s)
Run Code Online (Sandbox Code Playgroud)

结果

>>> if __name__ == '__main__':
...     s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
...     s = s.strip(string.punctuation)
...     s = s.replace(" ", '')
...     s = s.lower()
...     main(s)
... 
image
classification
methods
can
be
roughly
divided
into
two
broad
families
of
approaches
Run Code Online (Sandbox Code Playgroud)

注意事项

英语中有堕落的案例,这是不适用的.你需要使用回溯来处理这些,但这应该让你开始.

强制性测试

>>> main("expertsexchange")
experts
exchange
Run Code Online (Sandbox Code Playgroud)