Shi*_*ifu 115 python string nltk n-gram
我正在寻找一种将文本分成n-gram的方法.通常我会做类似的事情:
import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams
我知道nltk只提供bigrams和trigrams,但有没有办法将我的文本分成4克,5克甚至100克?
谢谢!
alv*_*vas 183
由其他用户提供的基于本机python的优秀答案.但这是nltk方法(以防万一,OP因重新发明nltk库中已有的内容而受到惩罚).
有一个人们很少使用的ngram模块nltk.这不是因为它很难读取ngrams,而是训练基于ngrams的模型,其中n> 3将导致大量数据稀疏性.
from nltk import ngrams
sentence = 'this is a foo bar sentences and i want to ngramize it'
n = 6
sixgrams = ngrams(sentence.split(), n)
for grams in sixgrams:
  print grams
ins*_*get 58
我很惊讶这还没有出现:
In [34]: sentence = "I really like python, it's pretty awesome.".split()
In [35]: N = 4
In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]
In [37]: for gram in grams: print gram
['I', 'really', 'like', 'python,']
['really', 'like', 'python,', "it's"]
['like', 'python,', "it's", 'pretty']
['python,', "it's", 'pretty', 'awesome.']
M.A*_*san 13
这是做n-gram的另一种简单方法
>>> from nltk.util import ngrams
>>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
>>> tokenize = nltk.word_tokenize(text)
>>> tokenize
['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
>>> bigrams = ngrams(tokenize,2)
>>> bigrams
[('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')]
>>> trigrams=ngrams(tokenize,3)
>>> trigrams
[('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')]
>>> fourgrams=ngrams(tokenize,4)
>>> fourgrams
[('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]
Δημ*_*πάς 11
仅使用nltk工具
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
def get_ngrams(text, n ):
    n_grams = ngrams(word_tokenize(text), n)
    return [ ' '.join(grams) for grams in n_grams]
示例输出
get_ngrams('This is the simplest text i could think of', 3 )
['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']
为了保持数组格式的ngrams只需删除 ' '.join  
您可以使用以下命令轻松启动自己的功能itertools:
from itertools import izip, islice, tee
s = 'spam and eggs'
N = 3
trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N))))
list(trigrams)
# [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '),
# ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'),
# ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'),
# ('e', 'g', 'g'), ('g', 'g', 's')]
使用 python 的 builtin 构建二元组的更优雅的方法zip()。只需将原始字符串转换为列表split(),然后正常传递列表一次,并偏移一个元素。
string = "I really like python, it's pretty awesome."
def find_bigrams(s):
    input_list = s.split(" ")
    return zip(input_list, input_list[1:])
def find_ngrams(s, n):
  input_list = s.split(" ")
  return zip(*[input_list[i:] for i in range(n)])
find_bigrams(string)
[('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')]
人们已经对需要二元组或三元组的情况已经很好地回答了,但是在这种情况下,如果您需要句子的每一格,都可以使用nltk.util.everygrams 
>>> from nltk.util import everygrams
>>> message = "who let the dogs out"
>>> msg_split = message.split()
>>> list(everygrams(msg_split))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out'), ('who', 'let', 'the'), ('let', 'the', 'dogs'), ('the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs'), ('let', 'the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs', 'out')]
如果您有一个限制(如三字母组),最大长度应为3,则可以使用max_len参数来指定它。
>>> list(everygrams(msg_split, max_len=2))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out')]
您可以修改max_len参数以实现任意克,即4克,5克,6甚至100克。
可以修改前面提到的解决方案以实现上面提到的解决方案,但是此解决方案比这要简单得多。
欲了解更多信息,请点击这里
而且,当您只需要一个特定的语法,例如bigram或trigram等时,可以使用MAHassan的答案中提到的nltk.util.ngrams。