相关疑难解决方法(0)

nltk中使用前瞻和回溯依赖关系的句子的概率树

nltk或任何其他NLP工具是否允许基于输入句子构造概率树,从而将输入文本的语言模型存储在字典树中,以下示例给出了粗略的想法,但我需要相同的功能,使得单词Wt可以不仅仅是对过去的输入词(历史)Wt-n进行概率建模,而且还对Wt + m等前瞻性词进行概率建模.此外,回顾和前瞻字数也应该是2或更多,即bigrams或更多.python中有没有其他库可以实现这个目的？

from collections import defaultdict
import nltk
import math

ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
    tokens = map(str.lower, nltk.word_tokenize(sentence))
    for token, next_token in zip(tokens, tokens[1:]):
        ngram[token][next_token] += 1
for token in ngram:
    total = math.log10(sum(ngram[token].values()))
    ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}

Run Code Online (Sandbox Code Playgroud)

解决方案需要前瞻和回顾,特殊的子类字典可能有助于解决这个问题.也可以指向谈论实现这样一个系统的相关资源.nltk.models似乎做了类似的事情,但已不再可用.NLP中是否存在实现此想法的现有设计模式？基于跳过克的模型也类似于这个想法,但我觉得这应该已经在某处实现了.

python dictionary nlp linguistics nltk

sta*_*kit

2017 05-23

12
推荐指数

1
解决办法

866
查看次数

有没有办法用scikit-learn实现跳过克？

有没有办法在scikit学习库上实现跳过克？我已经手动生成了一个包含n-skim gram的列表,并将其作为scikit-learn方法的词汇表传递给skipgrams .

不幸的是,它在预测方面的表现非常差:只有63%的准确率.但是,CountVectorizer()使用CountVectorizer()默认代码时,我的准确率为77-80%.

是否有更好的方法来实施scikit中的skip-gram学习？

这是我的代码部分:

corpus = GetCorpus() # This one get text from file as a list

vocabulary = list(GetVocabulary(corpus,k,n))  
# this one returns a k-skip n-gram   

vec = CountVectorizer(
          tokenizer=lambda x: x.split(),
          ngram_range=(2,2),
          stop_words=stopWords,
          vocabulary=vocabulary)

Run Code Online (Sandbox Code Playgroud)

python machine-learning scikit-learn

Md.*_*man

2019 04-21

8
推荐指数

2
解决办法

2231
查看次数