San*_*ino 1 python gensim word2vec doc2vec
我想上手word2vec和doc2vec使用优秀的教程,在这里和这里,并试图使用代码样本.我只添加了line_clean()删除标点符号,停用词等的方法.
但是我line_clean()在训练迭代中调用的方法遇到了麻烦.我理解对全局方法的调用搞砸了,但我不知道如何解决这个问题.
Iteration 1
Traceback (most recent call last):
File "/Users/santino/Dev/doc2vec_exp/doc2vec_exp_app/doc2vec/untitled.py", line 96, in <module>
train()
File "/Users/santino/Dev/doc2vec_exp/doc2vec_exp_app/doc2vec/untitled.py", line 91, in train
model.train(sentences.sentences_perm(),total_examples=model.corpus_count,epochs=model.iter)
File "/Users/santino/Dev/doc2vec_exp/doc2vec_exp_app/doc2vec/untitled.py", line 61, in sentences_perm
shuffled = list(self.sentences)
AttributeError: 'TaggedLineSentence' object has no attribute 'sentences'
Run Code Online (Sandbox Code Playgroud)
我的代码如下:
import gensim
from gensim import utils
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec
import os
import random
import numpy
from sklearn.linear_model import LogisticRegression
import logging
import sys
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
tokenizer = RegexpTokenizer(r'\w+')
stopword_set = set(stopwords.words('english'))
def clean_line(line):
new_str = unicode(line, errors='replace').lower() #encoding issues
dlist = tokenizer.tokenize(new_str)
dlist = list(set(dlist).difference(stopword_set))
new_line = ' '.join(dlist)
return new_line
class TaggedLineSentence(object):
def __init__(self, sources):
self.sources = sources
flipped = {}
# make sure that keys are unique
for key, value in sources.items():
if value not in flipped:
flipped[value] = [key]
else:
raise Exception('Non-unique prefix encountered')
def __iter__(self):
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
yield TaggedDocument(utils.to_unicode(clean_line(line)).split(), [prefix + '_%s' % item_no])
def to_array(self):
self.sentences = []
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
self.sentences.append(TaggedDocument(utils.to_unicode(clean_line(line)).split(), [prefix + '_%s' % item_no]))
return(self.sentences)
def sentences_perm(self):
shuffled = list(self.sentences)
random.shuffle(shuffled)
return(shuffled)
def train():
#create a list data that stores the content of all text files in order of their names in docLabels
doc_files = [f for f in os.listdir('./data/') if f.endswith('.csv')]
sources = {}
for doc in doc_files:
doc2 = os.path.join('./data',doc)
sources[doc2] = doc.replace('.csv','')
sentences = TaggedLineSentence(sources)
# #iterator returned over all documents
model = gensim.models.Doc2Vec(size=300, min_count=2, alpha=0.025, min_alpha=0.025)
model.build_vocab(sentences)
#training of model
for epoch in range(10):
#random.shuffle(sentences)
print 'iteration '+str(epoch+1)
#model.train(it)
model.alpha -= 0.002
model.min_alpha = model.alpha
model.train(sentences.sentences_perm(),total_examples=model.corpus_count,epochs=model.iter)
#saving the created model
model.save('reddit.doc2vec')
print "model saved"
train()
Run Code Online (Sandbox Code Playgroud)
这些并不是最新版本的精彩教程gensim.特别是,train()使用您自己的alpha/ 手动管理在循环中多次调用是个坏主意min_alpha.这很容易搞砸 - 例如,代码中会出现错误的事情 - 并且对大多数用户没有任何好处.不要改变min_alpha默认值,只调用train()一次 - 然后完成epochs迭代,将学习率alpha从最大值衰减到最小值.
您的具体错误是因为您的TaggedLineSentence类没有sentences属性 - 至少在to_array()调用之后才会有 - 但代码却试图访问该不存在的属性.
整个to_array()/ sentences_perm()方法有点破碎.使用这种可迭代类的原因通常是将大型数据集保留在主存储器之外,从磁盘传输它.但是to_array()然后只需加载所有内容,将其缓存在类中 - 消除可迭代的好处.如果你负担得起,因为完整的数据集很容易适应内存,你可以做...
sentences = list(TaggedLineSentence(sources)
Run Code Online (Sandbox Code Playgroud)
...从磁盘迭代一次,然后将语料库保留在内存列表中.
通常不需要在训练期间反复洗牌.只有当训练数据有一些现有的聚集时 - 就像所有具有某些单词/主题的例子在排序的顶部或底部粘在一起 - 是本机排序可能导致训练问题.在这种情况下,在任何训练之前,单次洗牌应该足以消除结块.所以再次假设你的数据适合内存,你可以做...
sentences = random.shuffle(list(TaggedLineSentence(sources)
Run Code Online (Sandbox Code Playgroud)
...一次,然后你就sentences可以Doc2Vec在下面build_vocab()和train()(一次)传递给你.