use*_*903 14 python tf-idf python-3.x scikit-learn joblib
我有一个TfidfVectorizer
矢量化文章集合,然后是特征选择.
vectroizer = TfidfVectorizer()
X_train = vectroizer.fit_transform(corpus)
selector = SelectKBest(chi2, k = 5000 )
X_train_sel = selector.fit_transform(X_train, y_train)
Run Code Online (Sandbox Code Playgroud)
现在,我想存储它并在其他程序中使用它.我不想TfidfVectorizer()
在训练数据集上重新运行和选择特征选择器.我怎么做?我知道如何使模型持久使用,joblib
但我想知道这是否与使模型持久化相同.
Mar*_*ina 12
你可以简单地使用内置的pickle lib:
pickle.dump(vectorizer, open("vectorizer.pickle", "wb"))
pickle.dump(selector, open("selector.pickle", "wb"))
Run Code Online (Sandbox Code Playgroud)
并加载它:
vectorizer = pickle.load(open("vectorizer.pickle"), "rb"))
selector = pickle.load(open("selector.pickle"), "rb"))
Run Code Online (Sandbox Code Playgroud)
Pickle会将对象序列化为磁盘,并在需要时再次将其加载到内存中
“使对象持久化”基本上意味着您将要存储存储在内存中的二进制代码,该二进制代码表示对象在硬盘驱动器上的文件中,以便以后在程序或任何其他程序中可以将对象从硬盘驱动器中的文件重新加载到内存中。
无论是scikit学习包括joblib
或STDLIB pickle
和cPickle
会做的工作。我倾向于使用cPickle
它,因为它速度更快。使用ipython的%timeit命令:
>>> from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
>>> t = TFIDF()
>>> t.fit_transform(['hello world'], ['this is a test'])
# generic serializer - deserializer test
>>> def dump_load_test(tfidf, serializer):
...: with open('vectorizer.bin', 'w') as f:
...: serializer.dump(tfidf, f)
...: with open('vectorizer.bin', 'r') as f:
...: return serializer.load(f)
# joblib has a slightly different interface
>>> def joblib_test(tfidf):
...: joblib.dump(tfidf, 'tfidf.bin')
...: return joblib.load('tfidf.bin')
# Now, time it!
>>> %timeit joblib_test(t)
100 loops, best of 3: 3.09 ms per loop
>>> %timeit dump_load_test(t, pickle)
100 loops, best of 3: 2.16 ms per loop
>>> %timeit dump_load_test(t, cPickle)
1000 loops, best of 3: 879 µs per loop
Run Code Online (Sandbox Code Playgroud)
现在,如果要将多个对象存储在一个文件中,则可以轻松创建一个数据结构来存储它们,然后转储数据结构本身。这将有工作tuple
,list
或dict
。从您的问题的示例:
# train
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(corpus)
selector = SelectKBest(chi2, k = 5000 )
X_train_sel = selector.fit_transform(X_train, y_train)
# dump as a dict
data_struct = {'vectorizer': vectorizer, 'selector': selector}
# use the 'with' keyword to automatically close the file after the dump
with open('storage.bin', 'wb') as f:
cPickle.dump(data_struct, f)
Run Code Online (Sandbox Code Playgroud)
以后或在另一个程序中,以下语句将带回程序内存中的数据结构:
# reload
with open('storage.bin', 'rb') as f:
data_struct = cPickle.load(f)
vectorizer, selector = data_struct['vectorizer'], data_struct['selector']
# do stuff...
vectors = vectorizer.transform(...)
vec_sel = selector.transform(vectors)
Run Code Online (Sandbox Code Playgroud)
这是我使用joblib的答案:
joblib.dump(vectorizer, 'vectroizer.pkl')
joblib.dump(selector, 'selector.pkl')
Run Code Online (Sandbox Code Playgroud)
稍后,我可以加载它并准备开始:
vectorizer = joblib.load('vectorizer.pkl')
selector = joblib.load('selector.pkl')
test = selector.trasnform(vectorizer.transform(['this is test']))
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
9072 次 |
最近记录: |