相关疑难解决方法(0)

什么是ngram计数以及如何使用nltk实现？

我读过一篇论文,它使用ngram计数作为分类器的特征,我想知道这究竟意味着什么.

示例文本:"Lorem ipsum dolor sit amet,consetetur sadipscing elitr,sed diam"

我可以在本文中创建unigrams,bigrams,trigrams等,在这里我必须定义创建这些unigrams的"级别"."级别"可以是字符,音节,单词,......

因此,从上面的句子中创建unigrams只会创建所有单词的列表？

创建双字母组合会导致单词对将相互跟随的单词组合在一起吗？

因此,如果论文讨论ngram计数,它只会在文本中创建unigrams,bigrams,trigrams等,并计算ngram发生的频率？

python的nltk包中是否存在现有方法？或者我必须实现自己的版本？

python nlp nltk

ako*_*out

lucky-day

14
推荐指数

1
解决办法

2万
查看次数

NLTK 语言建模混乱

我想在 python 中使用 NLTK 训练语言模型，但我遇到了几个问题。首先，我不知道为什么我的文字在我写这样的东西时变成了字符：

s = "Natural-language processing (NLP) is an area of computer science " \
"and artificial intelligence concerned with the interactions " \
"between computers and human (natural) languages."
s = s.lower();


paddedLine = pad_both_ends(word_tokenize(s),n=2);

train, vocab = padded_everygram_pipeline(2, paddedLine)
print(list(vocab))
lm = MLE(2);
lm.fit(train,vocab)

Run Code Online (Sandbox Code Playgroud)

并且打印出来的词汇是这样的，显然是不正确的（我不想使用字符！），这是输出的一部分。：

<s>', '<', 's', '>', '</s>', '<s>', 'n', 'a', 't', 'u', 'r', 'a', 'l', '-', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', '</s>', '<s>', 'p', 'r', 'o', 'c', 'e', 's', …

Run Code Online (Sandbox Code Playgroud)

python nlp machine-learning nltk

Pey*_*ghi

2020 01-05

1
推荐指数

1
解决办法

1872
查看次数