Fri*_*der 3 python scikit-learn spacy
我试图在更大的 scikit-learn 管道中使用 spacy 作为标记器,但始终遇到任务无法被腌制以发送给工作人员的问题。
最小的例子:
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import fetch_20newsgroups
from functools import partial
import spacy
def spacy_tokenize(text, nlp):
return [x.orth_ for x in nlp(text)]
nlp = spacy.load('en', disable=['ner', 'parser', 'tagger'])
tok = partial(spacy_tokenize, nlp=nlp)
pipeline = Pipeline([('vectorize', CountVectorizer(tokenizer=tok)),
('clf', SGDClassifier())])
params = {'vectorize__ngram_range': [(1, 2), (1, 3)]}
CV = RandomizedSearchCV(pipeline,
param_distributions=params,
n_iter=2, cv=2, n_jobs=2,
scoring='accuracy')
categories = ['alt.atheism', 'comp.graphics']
news = fetch_20newsgroups(subset='train',
categories=categories,
shuffle=True,
random_state=42)
CV.fit(news.data, news.target)
Run Code Online (Sandbox Code Playgroud)
运行此代码我收到错误:
PicklingError: Could not pickle the task to send it to the workers.
Run Code Online (Sandbox Code Playgroud)
让我困惑的是:
import pickle
pickle.dump(tok, open('test.pkl', 'wb'))
Run Code Online (Sandbox Code Playgroud)
工作没有问题。
有谁知道是否可以将 spacy 与 sklearn 交叉验证一起使用?谢谢!
这不是解决方案,而是解决方法。看起来 spacy 和 joblib 之间存在一些问题:
如果您可以将分词器保存为目录中单独文件中的函数,然后将其导入到当前文件中,则可以避免此错误。就像是:
自定义文件.py
import spacy
nlp = spacy.load('en', disable=['ner', 'parser', 'tagger'])
def spacy_tokenizer(doc):
return [x.orth_ for x in nlp(doc)]
Run Code Online (Sandbox Code Playgroud)主要.py
#Other code
...
...
from custom_file import spacy_tokenizer
pipeline = Pipeline([('vectorize', CountVectorizer(tokenizer=spacy_tokenizer)),
('clf', SGDClassifier())])
...
...
Run Code Online (Sandbox Code Playgroud)| 归档时间: |
|
| 查看次数: |
3997 次 |
| 最近记录: |