Shl*_*rtz 5 python machine-learning k-means word2vec doc2vec
我已经通过网络上找到的示例修补了以下代码:
# gensim modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans
# random
from random import shuffle
# classifier
class LabeledLineSentence(object):
def __init__(self, sources):
self.sources = sources
flipped = {}
# make sure that keys are unique
for key, value in sources.items():
if value not in flipped:
flipped[value] = [key]
else:
raise Exception('Non-unique prefix encountered')
def __iter__(self):
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])
def to_array(self):
self.sentences = []
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
return self.sentences
def sentences_perm(self):
shuffle(self.sentences)
return self.sentences
sources = {'test.txt' : 'DOCS'}
sentences = LabeledLineSentence(sources)
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(sentences.to_array())
for epoch in range(10):
model.train(sentences.sentences_perm())
print(model.docvecs)
Run Code Online (Sandbox Code Playgroud)
我的test.txt文件每行包含一个段落.
代码运行正常,并为每行文本生成DocvecsArray
我的目标是得到这样的输出:
群集1:[DOC_5,DOC_100,... DOC_N]
群集2:[DOC_0,DOC_1,... DOC_N]
我找到了以下答案,但输出结果如下:
集群1:[word,word ... word]
集群2:[word,word ... word]
如何更改代码并获取文档集群?
归档时间: |
|
查看次数: |
5051 次 |
最近记录: |