Sklearn:将lemmatizer添加到CountVectorizer

Ren*_*ens 3 python lemmatization scikit-learn countvectorizer

我在我的计数器中添加了词形还原,正如Sklearn页面上所解释的那样.

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer,
                       strip_accents = 'unicode',
                       stop_words = 'english',
                       lowercase = True,
                       token_pattern = r'\b[a-zA-Z]{3,}\b', # keeps words of 3 or more characters
                       max_df = 0.5,
                       min_df = 10)
Run Code Online (Sandbox Code Playgroud)

但是,在创建时DTM使用fit_transform,我得到的错误如下(其中我也没有什么意义).在将词形还原添加到我的矢量化器之前,dtm代码始终有效.我深入研究了手册,并尝试了一些代码,但找不到任何解决方案.

dtm_tf = tf_vectorizer.fit_transform(articles)
Run Code Online (Sandbox Code Playgroud)

更新:

按照下面的@ MaxU的建议,代码运行没有错误,但数字和标点符号没有从我的输出中省略.我运行单独的测试,看看以后哪些功能有效,哪些LemmaTokenizer()无效.结果如下:

strip_accents = 'unicode', # works
stop_words = 'english', # works
lowercase = True, # works
token_pattern = r'\b[a-zA-Z]{3,}\b', # does not work
max_df = 0.5, # works
min_df = 10 # works
Run Code Online (Sandbox Code Playgroud)

显然,它只是token_pattern变得不活跃.这是更新和工作的代码没有token_pattern(我只需要先安装'punkt'和'wordnet'包):

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
                                strip_accents = 'unicode', # works 
                                stop_words = 'english', # works
                                lowercase = True, # works
                                max_df = 0.5, # works
                                min_df = 10) # works
Run Code Online (Sandbox Code Playgroud)

对于那些想要删除数字,标点符号和少于3个字符的单词(但不知道如何)的人,这里有一种方法可以让我在使用Pandas数据帧时工作

# when working from Pandas dataframe

df['TEXT'] = df['TEXT'].str.replace('\d+', '') # for digits
df['TEXT'] = df['TEXT'].str.replace(r'(\b\w{1,2}\b)', '') # for words
df['TEXT'] = df['TEXT'].str.replace('[^\w\s]', '') # for punctuation 
Run Code Online (Sandbox Code Playgroud)

Max*_*axU 6

它应该是:

tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
# NOTE:                        ---------------------->  ^^
Run Code Online (Sandbox Code Playgroud)

代替:

tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer,
Run Code Online (Sandbox Code Playgroud)