空间和scikit学习矢量化器

Question

空间和scikit学习矢量化器

tkj*_*kja 5 python nlp scikit-learn spacy

我根据他们的示例使用spaCy为scikit-learn写了一个lemma令牌生成器，它可以独立运行：

import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

class LemmaTokenizer(object):
    def __init__(self):
        self.spacynlp = spacy.load('en')
    def __call__(self, doc):
        nlpdoc = self.spacynlp(doc)
        nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
        return nlpdoc

vect = TfidfVectorizer(tokenizer=LemmaTokenizer())
vect.fit(['Apples and oranges are tasty.'])
print(vect.vocabulary_)
### prints {'apple': 1, 'and': 0, 'tasty': 4, 'be': 2, 'orange': 3}

Run Code Online (Sandbox Code Playgroud)

但是，使用它GridSearchCV会产生错误，下面是一个自包含的示例：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

wordvect = TfidfVectorizer(analyzer='word', strip_accents='ascii', tokenizer=LemmaTokenizer())
classifier = OneVsRestClassifier(SVC(kernel='linear'))
pipeline = Pipeline([('vect', wordvect), ('classifier', classifier)])
parameters = {'vect__min_df': [1, 2], 'vect__max_df': [0.7, 0.8], 'classifier__estimator__C': [0.1, 1, 10]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=7, verbose=1)

from sklearn.datasets import fetch_20newsgroups
categories = ['comp.graphics', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'), shuffle=True, categories=categories)
X = newsgroups.data
y = newsgroups.target
gs_clf = gs_clf.fit(X, y)

### AttributeError: 'spacy.tokenizer.Tokenizer' object has no attribute '_prefix_re'

Run Code Online (Sandbox Code Playgroud)

当我在令牌生成器的构造函数之外加载spacy时，不会出现该错误，然后GridSearchCV运行：

spacynlp = spacy.load('en')
    class LemmaTokenizer(object):
        def __call__(self, doc):
            nlpdoc = spacynlp(doc)
            nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
            return nlpdoc

Run Code Online (Sandbox Code Playgroud)

但是，这意味着，每一个我n_jobs从GridSearchCV意志的访问和调用相同spacynlp对象，正是这些工作之间共享，这让问题：

spacynlp对象是否spacy.load('en')可以安全用于GridSearchCV中的多个作业？
这是在scikit-learn的标记程序内实现对spacy的调用的正确方法吗？

Answer 1

mba*_*rov 0

对网格中的每个参数设置运行 Spacy 会浪费时间。内存开销也很大。您应该通过 Spacy 运行一次所有数据并将其保存到磁盘，然后使用简化的矢量化器读取预先词形还原的数据。查看tokenizer、analyser和preprocessor的参数TfidfVectorizer。关于堆栈溢出有很多示例，展示了如何构建自定义矢量化器。

“你在浪费时间”，“有很多例子”。这个答案并不是那么有用。 (13认同)
“有很多关于堆栈溢出的示例，展示了如何构建自定义矢量化器”，链接到这些示例中的至少一个会很有帮助 (4认同)

归档时间：	8 年，9 月前
查看次数：	2884 次
最近记录：	8 年，9 月前