哪个ngram实现在python中最快?
我试图描述nltk的vs scott的zip(http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/):
from nltk.util import ngrams as nltkngram
import this, time
def zipngram(text,n=2):
return zip(*[text.split()[i:] for i in range(n)])
text = this.s
start = time.time()
nltkngram(text.split(), n=2)
print time.time() - start
start = time.time()
zipngram(text, n=2)
print time.time() - start
Run Code Online (Sandbox Code Playgroud)
[OUT]
0.000213146209717
6.50882720947e-05
Run Code Online (Sandbox Code Playgroud)
有没有更快的实现在python中生成ngrams?
以下word2ngrams函数从一个单词中提取字符3gram:
>>> x = 'foobar'
>>> n = 3
>>> [x[i:i+n] for i in range(len(x)-n+1)]
['foo', 'oob', 'oba', 'bar']
Run Code Online (Sandbox Code Playgroud)
这篇文章显示了单个单词的字符ngram提取,使用python快速实现字符n-gram。
但是,如果我有句子并且想提取字符ngram,该word2ngram()怎么办呢?
实现相同word2ngram和sent2ngram输出的正则表达式版本是什么?会更快吗?
我试过了:
import string, random, time
from itertools import chain
def word2ngrams(text, n=3):
""" Convert word into character ngrams. """
return [text[i:i+n] for i in range(len(text)-n+1)]
def sent2ngrams(text, n=3):
return list(chain(*[word2ngrams(i,n) for i in text.lower().split()]))
def sent2ngrams_simple(text, n=3):
text = text.lower()
return [text[i:i+n] for i in range(len(text)-n+1) if …Run Code Online (Sandbox Code Playgroud) 什么是一种简单的使用方法zip:
Input: (1,2,3,4,5)
Output: ((1,2),(2,3),(3,4),(4,5))
Run Code Online (Sandbox Code Playgroud)
编辑:是的,一般的ngram解决方案类似,但对于这样一个简单的任务来说太冗长了.请参阅下面的答案,了解原因.