python中TfidfVectorizer中n-gram的令牌模式

Question

python中TfidfVectorizer中n-gram的令牌模式

nik*_*osd 6 python regex n-gram scikit-learn

TfidfVectorizer是否使用python 正则表达式识别n-gram ？

在阅读scikit-learn TfidfVectorizer的文档时出现了这个问题,我看到在单词级别识别n-gram的模式是token_pattern=u'(?u)\b\w\w+\b'.我很难看到它是如何工作的.考虑bi-gram案例.如果我做:

    In [1]: import re
    In [2]: re.findall(u'(?u)\b\w\w+\b',u'this is a sentence! this is another one.')
    Out[2]: []

Run Code Online (Sandbox Code Playgroud)

我找不到任何双胞胎.鉴于:

    In [2]: re.findall(u'(?u)\w+ \w*',u'this is a sentence! this is another one.')
    Out[2]: [u'this is', u'a sentence', u'this is', u'another one']

Run Code Online (Sandbox Code Playgroud)

发现一些(但不是全部,例如u'is a',所有其他甚至计数的双字母都缺失).在解释\b字符函数时我做错了什么？

注意:根据正则表达式模块文档,re中的\b字符应该是:

\ b匹配空字符串,但仅匹配单词的开头或结尾.单词被定义为字母数字或下划线字符的序列,因此单词的结尾由空格或非字母数字的非下划线字符表示.

我看到问题解决识别蟒蛇正克的问题(见1,2),所以次要的问题是:我应该这样做,我的文字喂养TfidfVectorizer前添加加入正克？

Answer 1

ely*_*ase 1

您应该在正则表达式前面加上r. 以下作品：

>>> re.findall(r'(?u)\b\w\w+\b',u'this is a sentence! this is another one.')
[u'this', u'is', u'sentence', u'this', u'is', u'another', u'one']

Run Code Online (Sandbox Code Playgroud)

这是文档中的一个已知错误，但如果您查看源代码，它们确实使用原始文字。

归档时间：	10 年，9 月前
查看次数：	2127 次
最近记录：	8 年，10 月前