如何在python中计算skipgrams?

sta*_*kit 20 python nlp n-gram language-model

k skipgram是ngram,它是所有ngrams和每个(ki)skipgram的超集,直到(ki)== 0(包括0跳过克).那么如何在python中有效地计算这些跳过头文件呢?

以下是我尝试的代码,但它没有按预期执行:

<pre>
    input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
    def find_skipgrams(input_list, N,K):
  bigram_list = []
  nlist=[]

  K=1
  for k in range(K+1):
      for i in range(len(input_list)-1):
          if i+k+1<len(input_list):
              nlist=[]
              for j in range(N+1):
                  if i+k+j+1<len(input_list):
                    nlist.append(input_list[i+k+j+1])

          bigram_list.append(nlist)
  return bigram_list

</pre>
Run Code Online (Sandbox Code Playgroud)

上面的代码无法正确呈现,但print find_skipgrams(['all', 'this', 'happened', 'more', 'or', 'less'],2,1)提供以下输出

[['this','发生','更多'],['发生','更多','或',['更多','或','更少'],['或','更少'',['less'],['发生','更多','或'],['更多','或','更少',['或','更少'],['更少'], ['减']]

此处列出的代码也没有给出正确的输出:https: //github.com/heaven00/skipgram/blob/master/skipgram.py

print skipgram_ndarray("你叫什么名字")给出:['什么,是','是,你的','你的名字','名字','什么,你的','是,名字']

名字是一个unigram!

alv*_*vas 17

从OP链接的文件,以下字符串:

叛乱分子在持续战斗中阵亡

产量:

2-skip-bi-gram = {叛乱分子被杀,叛乱分子入侵,叛乱分子正在进行,被杀,被杀,正在进行中,战斗中死亡,正在进行中,在战斗中,正在进行的战斗}

2-skip-tri-gram = {叛乱分子被杀,叛乱分子正在阵亡,叛乱分子正在战斗中,叛乱分子正在进行中,叛乱分子在战斗中,叛乱分子正在进行战斗,正在进行中阵亡,在战斗中丧生,在战斗中丧生,正在进行战斗}.

稍微修改一下NLTK的ngrams代码(https://github.com/nltk/nltk/blob/develop/nltk/util.py#L383):

from itertools import chain, combinations
import copy
from nltk.util import ngrams

def pad_sequence(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
    if pad_left:
        sequence = chain((pad_symbol,) * (n-1), sequence)
    if pad_right:
        sequence = chain(sequence, (pad_symbol,) * (n-1))
    return sequence

def skipgrams(sequence, n, k, pad_left=False, pad_right=False, pad_symbol=None):
    sequence_length = len(sequence)
    sequence = iter(sequence)
    sequence = pad_sequence(sequence, n, pad_left, pad_right, pad_symbol)

    if sequence_length + pad_left + pad_right < k:
        raise Exception("The length of sentence + padding(s) < skip")

    if n < k:
        raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")    

    history = []
    nk = n+k

    # Return point for recursion.
    if nk < 1: 
        return
    # If n+k longer than sequence, reduce k by 1 and recur
    elif nk > sequence_length: 
        for ng in skipgrams(list(sequence), n, k-1):
            yield ng

    while nk > 1: # Collects the first instance of n+k length history
        history.append(next(sequence))
        nk -= 1

    # Iterative drop first item in history and picks up the next
    # while yielding skipgrams for each iteration.
    for item in sequence:
        history.append(item)
        current_token = history.pop(0)      
        # Iterates through the rest of the history and 
        # pick out all combinations the n-1grams
        for idx in list(combinations(range(len(history)), n-1)):
            ng = [current_token]
            for _id in idx:
                ng.append(history[_id])
            yield tuple(ng)

    # Recursively yield the skigrams for the rest of seqeunce where
    # len(sequence) < n+k
    for ng in list(skipgrams(history, n, k-1)):
        yield ng
Run Code Online (Sandbox Code Playgroud)

让我们做一些doctest来匹配论文中的例子:

>>> two_skip_bigrams = list(skipgrams(text, n=2, k=2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> two_skip_trigrams = list(skipgrams(text, n=3, k=2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
Run Code Online (Sandbox Code Playgroud)

但要注意,如果n+k > len(sequence)它会产生与skipgrams(sequence, n, k-1)(这不是一个错误,它是一个故障安全功能)相同的效果,例如

>>> three_skip_trigrams = list(skipgrams(text, n=3, k=3))
>>> three_skip_fourgrams = list(skipgrams(text, n=4, k=3))
>>> four_skip_fourgrams  = list(skipgrams(text, n=4, k=4))
>>> four_skip_fivegrams  = list(skipgrams(text, n=5, k=4))
>>>
>>> print len(three_skip_trigrams), three_skip_trigrams
10 [('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
>>> print len(three_skip_fourgrams), three_skip_fourgrams 
5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]
>>> print len(four_skip_fourgrams), four_skip_fourgrams 
5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]
>>> print len(four_skip_fivegrams), four_skip_fivegrams 
1 [('Insurgents', 'killed', 'in', 'ongoing', 'fighting')]
Run Code Online (Sandbox Code Playgroud)

这允许n == k但它不允许n > k如下所示:

if n < k:
        raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")    
Run Code Online (Sandbox Code Playgroud)

为了理解,让我们试着理解"神秘"的界限:

for idx in list(combinations(range(len(history)), n-1)):
    pass # Do something
Run Code Online (Sandbox Code Playgroud)

给定一个独特项目列表,组合产生:

>>> from itertools import combinations
>>> x = [0,1,2,3,4,5]
>>> list(combinations(x,2))
[(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5)]
Run Code Online (Sandbox Code Playgroud)

并且因为令牌列表的索引总是唯一的,例如

>>> sent = ['this', 'is', 'a', 'foo', 'bar']
>>> current_token = sent.pop(0) # i.e. 'this'
>>> range(len(sent))
[0,1,2,3]
Run Code Online (Sandbox Code Playgroud)

可以计算范围的可能组合(无需替换):

>>> n = 3
>>> list(combinations(range(len(sent)), n-1))
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
Run Code Online (Sandbox Code Playgroud)

如果我们将索引映射回标记列表:

>>> [tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)
[('is', 'a'), ('is', 'foo'), ('is', 'bar'), ('a', 'foo'), ('a', 'bar'), ('foo', 'bar')]
Run Code Online (Sandbox Code Playgroud)

然后我们连接current_token,我们得到当前令牌和上下文+跳过窗口的skipgrams:

>>> [tuple([current_token]) + tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)]
[('this', 'is', 'a'), ('this', 'is', 'foo'), ('this', 'is', 'bar'), ('this', 'a', 'foo'), ('this', 'a', 'bar'), ('this', 'foo', 'bar')]
Run Code Online (Sandbox Code Playgroud)

所以在那之后我们继续下一个词.


alv*_*vas 9

EDITED

最新的NLTK版本3.2.5已skipgrams实现.

以下是来自NLTK回购的@jnothman的更清晰的实现:https://github.com/nltk/nltk/blob/develop/nltk/util.py#L538

def skipgrams(sequence, n, k, **kwargs):
    """
    Returns all possible skipgrams generated from a sequence of items, as an iterator.
    Skipgrams are ngrams that allows tokens to be skipped.
    Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf

    :param sequence: the source data to be converted into trigrams
    :type sequence: sequence or iter
    :param n: the degree of the ngrams
    :type n: int
    :param k: the skip distance
    :type  k: int
    :rtype: iter(tuple)
    """

    # Pads the sequence as desired by **kwargs.
    if 'pad_left' in kwargs or 'pad_right' in kwargs:
    sequence = pad_sequence(sequence, n, **kwargs)

    # Note when iterating through the ngrams, the pad_right here is not
    # the **kwargs padding, it's for the algorithm to detect the SENTINEL
    # object on the right pad to stop inner loop.
    SENTINEL = object()
    for ngram in ngrams(sequence, n + k, pad_right=True, right_pad_symbol=SENTINEL):
    head = ngram[:1]
    tail = ngram[1:]
    for skip_tail in combinations(tail, n - 1):
        if skip_tail[-1] is SENTINEL:
            continue
        yield head + skip_tail
Run Code Online (Sandbox Code Playgroud)

[OUT]:

>>> from nltk.util import skipgrams
>>> sent = "Insurgents killed in ongoing fighting".split()
>>> list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> list(skipgrams(sent, 3, 2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
Run Code Online (Sandbox Code Playgroud)


小智 5

虽然这完全取决于您的代码并将其推迟到外部库; 您可以使用Colibri Core(https://proycon.github.io/colibri-core)进行摘要提取.它是一个专门为从大文本语料库中提取高效n-gram和skipgram而编写的库.代码库使用C++(速度/效率),但可以使用Python绑定.

你理所当然地提到效率,因为skipgram提取很快就显示出指数复杂性,如果你只是像你一样传递一个句子,这可能不是一个大问题input_list,但是如果你在大型语料库数据上发布它会变得有问题.为了缓解这种情况,您可以设置参数(例如出现阈值),或者要求每个跳过的跳过可以至少x个不同的n-gram.

import colibricore

#Prepare corpus data (will be encoded for efficiency)
corpusfile_plaintext = "somecorpus.txt" #input, one sentence per line
encoder = colibricore.ClassEncoder()
encoder.build(corpusfile_plaintext)
corpusfile = "somecorpus.colibri.dat" #corpus output
classfile = "somecorpus.colibri.cls" #class encoding output
encoder.encodefile(corpusfile_plaintext,corpusfile)
encoder.save(classfile)

#Set options for skipgram extraction (mintokens is the occurrence threshold, maxlength maximum ngram/skipgram length)
colibricore.PatternModelOptions(mintokens=2,maxlength=8,doskipgrams=True)

#Instantiate an empty pattern model 
model = colibricore.UnindexedPatternModel()

#Train the model on the encoded corpus file (this does the skipgram extraction)
model.train(corpusfile, options)

#Load a decoder so we can view the output
decoder = colibricore.ClassDecoder(classfile)

#Output all skipgrams
for pattern in model:
     if pattern.category() == colibricore.Category.SKIPGRAM:
         print(pattern.tostring(decoder))
Run Code Online (Sandbox Code Playgroud)

在网站上有关于所有这些的更广泛的Python教程.

免责声明:我是Colibri Core的作者