背景:
我试图比较一对单词,看看哪一对在美国英语中比另一对更"可能发生".我的计划是使用NLTK中的搭配设施对单词对进行评分,评分最高的对是最有可能的.
做法:
我使用NLTK在Python中编写了以下代码(为简洁起见,删除了几个步骤和导入):
bgm = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
scored = finder.score_ngrams( bgm.likelihood_ratio )
print scored
Run Code Online (Sandbox Code Playgroud)
结果:
然后我用两个单词对检查结果,其中一个应该很可能共同发生,一个不应该("烤腰果"和"汽油腰果").我惊讶地看到这些单词配对得分相同:
[(('roasted', 'cashews'), 5.545177444479562)]
[(('gasoline', 'cashews'), 5.545177444479562)]
Run Code Online (Sandbox Code Playgroud)
在我的测试中,我本以为"烤腰果"的得分高于"汽油腰果".
问题:
非常感谢您提供任何信息或帮助!
Rob*_*aus 31
NLTK搭配文件似乎对我很好. http://www.nltk.org/howto/collocations.html
您需要为得分手提供一些实际可用的大小语料库.这是一个使用内置于NLTK的Brown语料库的工作示例.运行大约需要30秒.
import nltk.collocations
import nltk.corpus
import collections
bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(
nltk.corpus.brown.words())
scored = finder.score_ngrams( bgm.likelihood_ratio )
# Group bigrams by first word in bigram.
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
prefix_keys[key[0]].append((key[1], scores))
# Sort keyed bigrams by strongest association.
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
print 'doctor', prefix_keys['doctor'][:5]
print 'baseball', prefix_keys['baseball'][:5]
print 'happy', prefix_keys['happy'][:5]
Run Code Online (Sandbox Code Playgroud)
输出似乎合理,适用于棒球,对医生和快乐不太好.
doctor [('bills', 35.061321987405748), (',', 22.963930079491501),
('annoys', 19.009636692022365),
('had', 16.730384189212423), ('retorted', 15.190847940499127)]
baseball [('game', 32.110754519752291), ('cap', 27.81891372457088),
('park', 23.509042621473505), ('games', 23.105033513054011),
("player's", 16.227872863424668)]
happy [("''", 20.296341424483998), ('Spahn', 13.915820697905589),
('family', 13.734352182441569),
(',', 13.55077617193821), ('bodybuilder', 13.513265447290536)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
21119 次 |
| 最近记录: |