我的第一篇文章!我在使用nltk NaiveBayesClassifier时遇到了问题.我有7000个训练项目.每个培训项目都有2或3个世界的描述和代码.我想使用代码作为类的标签和描述的每个世界作为功能.一个例子:
"我叫奥巴马",001 ......
训练集= {[feature ['My'] = True,feature ['name'] = True,feature ['is'] = True,feature [Obama] = True],001}
不幸的是,使用这种方法,训练程序NaiveBayesClassifier.train使用高达3 GB的ram ..我的方法有什么问题?谢谢!
def document_features(document): # feature extractor
document = set(document)
return dict((w, True) for w in document)
...
words=set()
entries = []
train_set= []
train_length = 2000
readfile = open("atcname.pl", 'r')
t = readfile.readline()
while (t!=""):
t = t.split("'")
code = t[0] #class
desc = t[1] # description
words = words.union(s) #update dictionary with the new words in the description
entries.append((s,code)) …Run Code Online (Sandbox Code Playgroud)