标签: collocation

用Python编写句子列表中的单词Bigrams

我有一个句子列表:

text = ['cant railway station','citadel hotel',' police stn'].

Run Code Online (Sandbox Code Playgroud)

我需要形成双字节对并将它们存储在变量中.问题是,当我这样做时,我会得到一对句子而不是单词.这是我做的:

text2 = [[word for word in line.split()] for line in text]
bigrams = nltk.bigrams(text2)
print(bigrams)

Run Code Online (Sandbox Code Playgroud)

产量

[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn'])

Run Code Online (Sandbox Code Playgroud)

不能火车站和城堡酒店组成一个二元组.我想要的是

[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on...

Run Code Online (Sandbox Code Playgroud)

第一句的最后一个单词不应与第二句的第一个单词合并.我该怎么做才能让它发挥作用？

python list-comprehension list nltk collocation

Hyp*_*nja

2016 04-30

20
推荐指数

4
解决办法

5万
查看次数

针对特定单词的NLTK搭配

我知道如何使用NLTK获得bigram和trigram搭配,并将它们应用到我自己的语料库中.代码如下.

然而,我不确定(1)如何获得特定单词的搭配？(2)NLTK是否具有基于对数似然比的配置度量？

import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize

text = "this is a foo bar bar black sheep  foo bar bar black sheep foo bar bar black  sheep shep bar bar black sentence"

trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(word_tokenize(text))

for i in finder.score_ngrams(trigram_measures.pmi):
    print i

Run Code Online (Sandbox Code Playgroud)

python nltk collocation

Sab*_*bba

2014 01-17

12
推荐指数

1
解决办法

8121
查看次数

如何在python nltk中获得n-gram搭配和关联？

在这个文件中,使用的例子nltk.collocations.BigramAssocMeasures(),BigramCollocationFinder,nltk.collocations.TrigramAssocMeasures(),和TrigramCollocationFinder.

对于bigram和trigram,有基于pmi的示例方法find nbest.例:

finder = BigramCollocationFinder.from_words(
...     nltk.corpus.genesis.words('english-web.txt'))
>>> finder.nbest(bigram_measures.pmi, 10)

Run Code Online (Sandbox Code Playgroud)

我知道BigramCollocationFinder并TrigramCollocationFinder继承自AbstractCollocationFinder.While BigramAssocMeasures()和TrigramAssocMeasures()继承自NgramAssocMeasures.

如何使用该方法(例如nbest())在AbstractCollocationFinder与NgramAssocMeasures4克,5克,6克,...,的n-gram(例如使用二元和三元语法容易)？

我应该创建继承的类AbstractCollocationFinder吗？

谢谢.

python nlp nltk n-gram collocation

Fah*_*zal

2015 12-12

6
推荐指数

2
解决办法

6775
查看次数

NLTK:查找单词大小为2k的上下文

我有一个语料库,我有一个词.对于语料库中每个单词的出现,我想得到一个包含前面的k个单词和单词后面的k个单词的列表.我在算法上做得很好(见下文),但我想知道NLTK是否为我错过了我的需求提供了一些功能？

def sized_context(word_index, window_radius, corpus):
    """ Returns a list containing the window_size amount of words to the left
    and to the right of word_index, not including the word at word_index.
    """

    max_length = len(corpus)

    left_border = word_index - window_radius
    left_border = 0 if word_index - window_radius < 0 else left_border

    right_border = word_index + 1 + window_radius
    right_border = max_length if right_border > max_length else right_border

    return corpus[left_border:word_index] + corpus[word_index+1: right_border]

Run Code Online (Sandbox Code Playgroud)

python nlp nltk collocation

Zak*_*kum

2015 06-08

3
推荐指数

1
解决办法

1179
查看次数

如何使用 NLTK 搭配获得三元组的 PMI 分数？Python

我知道如何使用 NLTK 获得二元组和三元组搭配，并将它们应用到我自己的语料库中。代码如下。

我唯一的问题是如何用 PMI 值打印出 birgram？我多次搜索 NLTK 文档。要么是我遗漏了什么，要么是不存在。

import nltk
from nltk.collocations import *

myFile = open("large.txt", 'r').read()
myList = myFile.split()
myCorpus = nltk.Text(myList)
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words((myCorpus))

finder.apply_freq_filter(3)
print finder.nbest(trigram_measures.pmi, 500000)

Run Code Online (Sandbox Code Playgroud)

python nlp nltk collocation

Sab*_*bba

2014 01-16

2
推荐指数

1
解决办法

7611
查看次数