如何使用sklearn的CountVectorizerand()来获取包含任何标点符号作为单独标记的ngram?

Fra*_*urt 5 python nlp tokenize n-gram scikit-learn

我使用sklearn.feature_extraction.text.CountVectorizer来计算n-gram.例:

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))
Run Code Online (Sandbox Code Playgroud)

输出:

4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']
Run Code Online (Sandbox Code Playgroud)

删除标点符号:如何将它们作为单独的标记包含在内?

Fra*_*urt 8

在使用参数创建sklearn.feature_extraction.text.CountVectorizer实例时,应指定将任何标点符号视为单独标记的单词标记生成器.tokenizer

例如,nltk.tokenize.TreebankWordTokenizer将大多数标点字符视为单独的标记:

import sklearn.feature_extraction.text
from nltk.tokenize import TreebankWordTokenizer

ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), \
                                                 tokenizer=TreebankWordTokenizer().tokenize)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))
Run Code Online (Sandbox Code Playgroud)

输出:

4-grams: [u"'s pretty awesome .", u", it 's pretty", u'i really like python', 
          u"it 's pretty awesome", u'like python , it', u"python , it 's", 
          u'really like python ,']
Run Code Online (Sandbox Code Playgroud)