doc2vec如何对DocvecsArray进行聚类

Shl*_*rtz 5 python machine-learning k-means word2vec doc2vec

我已经通过网络上找到的示例修补了以下代码:

# gensim modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans

# random
from random import shuffle

# classifier

class LabeledLineSentence(object):
    def __init__(self, sources):
        self.sources = sources

        flipped = {}

        # make sure that keys are unique
        for key, value in sources.items():
            if value not in flipped:
                flipped[value] = [key]
            else:
                raise Exception('Non-unique prefix encountered')

    def __iter__(self):
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])

    def to_array(self):
        self.sentences = []
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
        return self.sentences

    def sentences_perm(self):
        shuffle(self.sentences)
        return self.sentences

sources = {'test.txt' : 'DOCS'}
sentences = LabeledLineSentence(sources)

model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(sentences.to_array())

for epoch in range(10):
    model.train(sentences.sentences_perm())

print(model.docvecs)
Run Code Online (Sandbox Code Playgroud)

我的test.txt文件每行包含一个段落.

代码运行正常,并为每行文本生成DocvecsArray

我的目标是得到这样的输出:

群集1:[DOC_5,DOC_100,... DOC_N]
群集2:[DOC_0,DOC_1,... DOC_N]

我找到了以下答案,但输出结果如下:

集群1:[word,word ... word]
集群2:[word,word ... word]

如何更改代码并获取文档集群?

小智 7

所以看起来你几乎就在那里.

您正在输出一组向量.对于sklearn包,你必须将它们放入一个numpy数组 - 使用numpy.toarray()函数可能是最好的.KMeans 的文档非常出色,甚至在整个库中都很好.

给你的一个注意事项是,我对DBSCAN的运气要好于KMeans,它们都包含在同一个sklearn库中.DBSCAN不要求您指定要在输出上拥有的群集数量.

两个链接都有很好的注释代码示例.