相关疑难解决方法(0)

蟒蛇n克,四,五,六克？

我正在寻找一种将文本分成n-gram的方法.通常我会做类似的事情:

import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams

Run Code Online (Sandbox Code Playgroud)

我知道nltk只提供bigrams和trigrams,但有没有办法将我的文本分成4克,5克甚至100克？

谢谢!

python string nltk n-gram

Shi*_*ifu

2015 11-09

115
推荐指数

7
解决办法

12万
查看次数

在python中快速/优化N-gram实现

哪个ngram实现在python中最快？

我试图描述nltk的vs scott的zip(http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/):

from nltk.util import ngrams as nltkngram
import this, time

def zipngram(text,n=2):
  return zip(*[text.split()[i:] for i in range(n)])

text = this.s

start = time.time()
nltkngram(text.split(), n=2)
print time.time() - start

start = time.time()
zipngram(text, n=2)
print time.time() - start

Run Code Online (Sandbox Code Playgroud)

[OUT]

0.000213146209717
6.50882720947e-05

Run Code Online (Sandbox Code Playgroud)

有没有更快的实现在python中生成ngrams？

python nlp information-retrieval nltk n-gram

alv*_*vas

lucky-day

11
推荐指数

1
解决办法

3858
查看次数

python中TfidfVectorizer中n-gram的令牌模式

TfidfVectorizer是否使用python 正则表达式识别n-gram ？

在阅读scikit-learn TfidfVectorizer的文档时出现了这个问题,我看到在单词级别识别n-gram的模式是token_pattern=u'(?u)\b\w\w+\b'.我很难看到它是如何工作的.考虑bi-gram案例.如果我做:

    In [1]: import re
    In [2]: re.findall(u'(?u)\b\w\w+\b',u'this is a sentence! this is another one.')
    Out[2]: []

Run Code Online (Sandbox Code Playgroud)

我找不到任何双胞胎.鉴于:

    In [2]: re.findall(u'(?u)\w+ \w*',u'this is a sentence! this is another one.')
    Out[2]: [u'this is', u'a sentence', u'this is', u'another one']

Run Code Online (Sandbox Code Playgroud)

发现一些(但不是全部,例如u'is a',所有其他甚至计数的双字母都缺失).在解释\b字符函数时我做错了什么？

注意:根据正则表达式模块文档,re中的\b字符应该是: