我使用Python和NLTK构建语言模型如下:
from nltk.corpus import brown
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), estimator)
# Thanks to miku, I fixed this problem
print lm.prob("word", ["This is a context which generates a word"])
>> 0.00493261081006
# But I got another program like this one...
print lm.prob("b", ["This is a context which generates a word"])
Run Code Online (Sandbox Code Playgroud)
但它似乎没有用.结果如下:
>>> print lm.prob("word", "This is a context which generates a word")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 79, in prob
return self._alpha(context) * self._backoff.prob(word, context[1:])
File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 79, in prob
return self._alpha(context) * self._backoff.prob(word, context[1:])
File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 82, in prob
"context %s" % (word, ' '.join(context)))
TypeError: not all arguments converted during string formatting
Run Code Online (Sandbox Code Playgroud)
谁能帮我吗?谢谢!
Pet*_*nns 12
我知道这个问题已经很久了,但是每当我google nltk的NgramModel类时它都会弹出.NgramModel的概率实现有点不直观.提问者很困惑.据我所知,答案并不是很好.由于我不经常使用NgramModel,这意味着我感到困惑.不再.
源代码存在于此:https://github.com/nltk/nltk/blob/master/nltk/model/ngram.py.这是NgramModel的prob方法的定义:
def prob(self, word, context):
"""
Evaluate the probability of this word in this context using Katz Backoff.
:param word: the word to get the probability of
:type word: str
:param context: the context the word is in
:type context: list(str)
"""
context = tuple(context)
if (context + (word,) in self._ngrams) or (self._n == 1):
return self[context].prob(word)
else:
return self._alpha(context) * self._backoff.prob(word, context[1:])
Run Code Online (Sandbox Code Playgroud)
(注意:'self [context] .prob(word)相当于'self._model [context] .prob(word)')
好的.现在至少我们知道要寻找什么.上下文需要什么?让我们看一下构造函数的摘录:
for sent in train:
for ngram in ingrams(chain(self._lpad, sent, self._rpad), n):
self._ngrams.add(ngram)
context = tuple(ngram[:-1])
token = ngram[-1]
cfd[context].inc(token)
if not estimator_args and not estimator_kwargs:
self._model = ConditionalProbDist(cfd, estimator, len(cfd))
else:
self._model = ConditionalProbDist(cfd, estimator, *estimator_args, **estimator_kwargs)
Run Code Online (Sandbox Code Playgroud)
好的.构造函数从条件频率分布中创建条件概率分布(self._model),其"上下文"是unigrams的元组.这告诉我们'context' 不应该是字符串或具有单个多字符串的列表.'上下文' 必须是包含unigrams的可迭代的东西.事实上,要求更严格一些.这些元组或列表的大小必须为n-1.这样想吧.你告诉它是一个三元模型.你最好给它三卦的适当背景.
让我们通过一个更简单的例子看到这个:
>>> import nltk
>>> obs = 'the rain in spain falls mainly in the plains'.split()
>>> lm = nltk.NgramModel(2, obs, estimator=nltk.MLEProbDist)
>>> lm.prob('rain', 'the') #wrong
0.0
>>> lm.prob('rain', ['the']) #right
0.5
>>> lm.prob('spain', 'rain in') #wrong
0.0
>>> lm.prob('spain', ['rain in']) #wrong
'''long exception'''
>>> lm.prob('spain', ['rain', 'in']) #right
1.0
Run Code Online (Sandbox Code Playgroud)
(作为旁注,实际上尝试用MLE做任何事情作为NgramModel中的估算器是一个坏主意.事情会崩溃.我保证.)
至于最初的问题,我想我对OP想要的最好的猜测是:
print lm.prob("word", "generates a".split())
print lm.prob("b", "generates a".split())
Run Code Online (Sandbox Code Playgroud)
......但是这里有很多误解,我无法说出他究竟想做什么.
快速解决:
print lm.prob("word", ["This is a context which generates a word"])
# => 0.00493261081006
Run Code Online (Sandbox Code Playgroud)
关于你的第二个问题:这是因为"b"布朗语料库类别中没有出现这种情况news,因为您可以通过以下方式进行验证:
>>> 'b' in brown.words(categories='news')
False
Run Code Online (Sandbox Code Playgroud)
而
>>> 'word' in brown.words(categories='news')
True
Run Code Online (Sandbox Code Playgroud)
我承认错误信息非常含糊,因此您可能希望向NLTK作者提交错误报告.