Python - 计算word2vec向量的层次聚类,并将结果绘制为树形图

Shl*_*rtz 6 python numpy machine-learning hierarchical-clustering word2vec

我使用我的域文本语料库生成了一个100D word2vec模型,例如合并常用短语(good bye => good_bye).然后我提取了1000个所需单词的向量.

所以我有一个1000 numpy.array像这样:

[[-0.050378,0.855622,1.107467,0.456601,...[100 dimensions],
 [-0.040378,0.755622,1.107467,0.456601,...[100 dimensions],
 ...
 ...[1000 Vectors]
]
Run Code Online (Sandbox Code Playgroud)

和单词数组如下:

["hello","hi","bye","good_bye"...1000]
Run Code Online (Sandbox Code Playgroud)

我在我的数据上运行了K-Means,我得到的结果很有意义:

X = np.array(words_vectors)
kmeans = KMeans(n_clusters=20, random_state=0).fit(X)
for idx,l in enumerate(kmeans.labels_):
    print(l,words[idx])

--- Output ---
0 hello
0 hi
1 bye
1 good_bye
Run Code Online (Sandbox Code Playgroud)

0 =问候1 =告别

但是,有些词语让我觉得层次聚类更适合这项任务.我尝试过使用AgglomerativeClustering,不幸的是......对于这个Python nobee来说,事情变得复杂,我迷路了.

我如何聚类我的向量,所以输出将是一个树形图,或多或少,就像在这个维基页面上找到的那样? 在此输入图像描述

Ant*_*and 8

到现在为止我遇到了同样的问题!在网上搜索后总是发现你的帖子(关键字=在word2vec上的层次聚类).我不得不给你一个可能有效的解决方案.

sentences = ['hi', 'hello', 'hi hello', 'goodbye', 'bye', 'goodbye bye']
sentences_split = [s.lower().split(' ') for s in sentences]

import gensim
model = gensim.models.Word2Vec(sentences_split, min_count=2)

from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

l = linkage(model.wv.syn0, method='complete', metric='seuclidean')

# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.ylabel('word')
plt.xlabel('distance')

dendrogram(
    l,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=16.,  # font size for the x axis labels
    orientation='left',
    leaf_label_func=lambda v: str(model.wv.index2word[v])
)
plt.show()
Run Code Online (Sandbox Code Playgroud)