使用Python计算N Grams

gra*_*aci 23 python nlp nltk n-gram

我需要为包含以下文本的文本文件计算Unigrams,BiGrams和Trigrams:

"囊性纤维化仅影响美国3万名儿童和青少年.吸入盐水雾可减少充满囊性纤维化患者呼吸道的脓液和感染,但副作用包括令人讨厌的咳嗽和严酷的味道.这就是结论在本周出版的"新英格兰医学杂志"上发表的两项研究."

我从Python开始并使用以下代码:

#!/usr/bin/env python
# File: n-gram.py
def N_Gram(N,text):
NList = []                      # start with an empty list
if N> 1:
    space = " " * (N-1)         # add N - 1 spaces
    text = space + text + space # add both in front and back
# append the slices [i:i+N] to NList
for i in range( len(text) - (N - 1) ):
    NList.append(text[i:i+N])
return NList                    # return the list
# test code
for i in range(5):
print N_Gram(i+1,"text")
# more test code
nList = N_Gram(7,"Here is a lot of text to print")
for ngram in iter(nList):
print '"' + ngram + '"'
Run Code Online (Sandbox Code Playgroud)

http://www.daniweb.com/software-development/python/threads/39109/generating-n-grams-from-a-word

但它适用于一个单词中的所有n-gram,当我想要在CYSTIC和FIBROSIS或CYSTIC FIBROSIS之间的单词之间.有人可以帮我解决这个问题吗?

Fra*_*urt 36

这个博客的一个简短的Pythonesque解决方案:

def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])
Run Code Online (Sandbox Code Playgroud)

用法:

>>> input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
>>> find_ngrams(input_list, 1)
[('all',), ('this',), ('happened',), ('more',), ('or',), ('less',)]
>>> find_ngrams(input_list, 2)
[('all', 'this'), ('this', 'happened'), ('happened', 'more'), ('more', 'or'), ('or', 'less')]
>>> find_ngrams(input_list, 3))
[('all', 'this', 'happened'), ('this', 'happened', 'more'), ('happened', 'more', 'or'), ('more', 'or', 'less')]
Run Code Online (Sandbox Code Playgroud)


dav*_*off 31

假设输入是一个包含空格分隔单词的字符串,就像x = "a b c d"你可以使用以下函数一样(编辑:请参阅最后一个函数以获得更完整的解决方案):

def ngrams(input, n):
    input = input.split(' ')
    output = []
    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output

ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]
Run Code Online (Sandbox Code Playgroud)

如果你想把那些连接回字符串,你可能会这样说:

[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']
Run Code Online (Sandbox Code Playgroud)

最后,这并没有将事情总结为总数,所以如果您的输入是'a a a a',您需要将它们计算为dict:

for g in (' '.join(x) for x in ngrams(input, 2)):
    grams.setdefault(g, 0)
    grams[g] += 1
Run Code Online (Sandbox Code Playgroud)

将所有这些组合成一个最终函数给出:

def ngrams(input, n):
   input = input.split(' ')
   output = {}
   for i in range(len(input)-n+1):
       g = ' '.join(input[i:i+n])
       output.setdefault(g, 0)
       output[g] += 1
    return output

ngrams('a a a a', 2) # {'a a': 3}
Run Code Online (Sandbox Code Playgroud)


Spa*_*ost 25

使用NLTK(自然语言工具包)并使用函数将文本标记(拆分)到列表中,然后查找bigrams和trigrams.

import nltk
words = nltk.word_tokenize(my_text)
my_bigrams = nltk.bigrams(words)
my_trigrams = nltk.trigrams(words)
Run Code Online (Sandbox Code Playgroud)


Gun*_*jan 9

python中还有一个名为Scikit的有趣模块.这是代码.这将有助于您获得在特定范围内给出的所有克数.这是代码

from sklearn.feature_extraction.text import CountVectorizer 
text = "this is a foo bar sentences and i want to ngramize it"
vectorizer = CountVectorizer(ngram_range=(1,6))
analyzer = vectorizer.build_analyzer()
print analyzer(text)
Run Code Online (Sandbox Code Playgroud)

输出是

[u'this', u'is', u'foo', u'bar', u'sentences', u'and', u'want', u'to', u'ngramize', u'it', u'this is', u'is foo', u'foo bar', u'bar sentences', u'sentences and', u'and want', u'want to', u'to ngramize', u'ngramize it', u'this is foo', u'is foo bar', u'foo bar sentences', u'bar sentences and', u'sentences and want', u'and want to', u'want to ngramize', u'to ngramize it', u'this is foo bar', u'is foo bar sentences', u'foo bar sentences and', u'bar sentences and want', u'sentences and want to', u'and want to ngramize', u'want to ngramize it', u'this is foo bar sentences', u'is foo bar sentences and', u'foo bar sentences and want', u'bar sentences and want to', u'sentences and want to ngramize', u'and want to ngramize it', u'this is foo bar sentences and', u'is foo bar sentences and want', u'foo bar sentences and want to', u'bar sentences and want to ngramize', u'sentences and want to ngramize it']
Run Code Online (Sandbox Code Playgroud)

这里给出了1到6范围内给出的所有克数.它使用了名为countVectorizer的方法.这是链接.