Nic*_*ani 2 python nlp nltk text-classification textblob
我正在尝试使用python和textblob构建文本分类模型,该脚本在我的服务器上运行,并且将来的想法是用户将能够提交他们的文本并将其分类.我正在从csv加载训练集:
# -*- coding: utf-8 -*-
import sys
import codecs
sys.stdout = open('yyyyyyyyy.txt',"w");
from nltk.tokenize import word_tokenize
from textblob.classifiers import NaiveBayesClassifier
with open('file.csv', 'r', encoding='latin-1') as fp:
cl = NaiveBayesClassifier(fp, format="csv")
print(cl.classify("some text"))
Run Code Online (Sandbox Code Playgroud)
csv长约500行(字符串在10到100个字符之间),NaiveBayesclassifier需要大约2分钟进行训练,然后能够对我的文本进行分类(不确定是否正常,它需要这么多时间,也许是我的服务器很慢只有512mb ram).
csv行的例子:
"Oggi alla Camera con la Fondazione Italia-Usa abbiamo consegnato a 140 studenti laureati con 110 e 110 lode i diplomi del Master in Marketing Comunicazione e Made in Italy.",FI-PDL
Run Code Online (Sandbox Code Playgroud)
我不清楚,我无法找到textblob文档的答案,如果有一种方法来"保存"我训练有素的分类器(这样可以节省很多时间),因为现在每次运行脚本时它都会训练再次分类.我是文本分类和机器学习的新手,所以如果这是一个愚蠢的问题我会道歉.
提前致谢.
好的发现泡菜模块是我需要的:)
训练:
# -*- coding: utf-8 -*-
import pickle
from nltk.tokenize import word_tokenize
from textblob.classifiers import NaiveBayesClassifier
with open('file.csv', 'r', encoding='latin-1') as fp:
cl = NaiveBayesClassifier(fp, format="csv")
object = cl
file = open('classifier.pickle','wb')
pickle.dump(object,file)
Run Code Online (Sandbox Code Playgroud)
提取:
import pickle
sys.stdout = open('demo.txt',"w");
from nltk.tokenize import word_tokenize
from textblob.classifiers import NaiveBayesClassifier
cl = pickle.load( open( "classifier.pickle", "rb" ) )
print(cl.classify("text to classify"))
Run Code Online (Sandbox Code Playgroud)