Sab*_*bba 2 python nlp nltk collocation
我知道如何使用 NLTK 获得二元组和三元组搭配,并将它们应用到我自己的语料库中。代码如下。
我唯一的问题是如何用 PMI 值打印出 birgram?我多次搜索 NLTK 文档。要么是我遗漏了什么,要么是不存在。
import nltk
from nltk.collocations import *
myFile = open("large.txt", 'r').read()
myList = myFile.split()
myCorpus = nltk.Text(myList)
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words((myCorpus))
finder.apply_freq_filter(3)
print finder.nbest(trigram_measures.pmi, 500000)
Run Code Online (Sandbox Code Playgroud)
如果您查看源代码nlkt.collocations.TrigramCollocationFinder(请参阅http://www.nltk.org/_modules/nltk/collocations.html),您会发现它返回一个TrigramCollocationFinder().score_ngrams:
def nbest(self, score_fn, n):
"""Returns the top n ngrams when scored by the given function."""
return [p for p,s in self.score_ngrams(score_fn)[:n]]
Run Code Online (Sandbox Code Playgroud)
因此,您可以score_ngrams()直接调用而不获取 ,nbest因为它无论如何都会返回一个排序列表。:
def score_ngrams(self, score_fn):
"""Returns a sequence of (ngram, score) pairs ordered from highest to
lowest score, as determined by the scoring function provided.
"""
return sorted(self._score_ngrams(score_fn),
key=_itemgetter(1), reverse=True)
Run Code Online (Sandbox Code Playgroud)
尝试:
import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize
text = "this is a foo bar bar black sheep foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(word_tokenize(text))
for i in finder.score_ngrams(trigram_measures.pmi):
print i
Run Code Online (Sandbox Code Playgroud)
[出去]:
(('this', 'is', 'a'), 9.047123912114026)
(('is', 'a', 'foo'), 7.46216141139287)
(('black', 'sheep', 'shep'), 5.46216141139287)
(('black', 'sheep', 'foo'), 4.877198910671714)
(('a', 'foo', 'bar'), 4.462161411392869)
(('sheep', 'shep', 'bar'), 4.462161411392869)
(('bar', 'black', 'sheep'), 4.047123912114026)
(('bar', 'black', 'sentence'), 4.047123912114026)
(('sheep', 'foo', 'bar'), 3.877198910671714)
(('bar', 'bar', 'black'), 3.047123912114026)
(('foo', 'bar', 'bar'), 3.047123912114026)
(('shep', 'bar', 'bar'), 3.047123912114026)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
7611 次 |
| 最近记录: |