用于分层聚类Python的三角形与方形距离矩阵?

O.r*_*rka 4 python numpy machine-learning hierarchical-clustering scipy

我一直在试验,Hierarchical Clustering而且R它很简单hclust(as.dist(X),method="average") .我发现一个方法Python也很简单,除了我对输入距离矩阵发生的事情感到有点困惑.

我有一个相似性矩阵(DF_c93tom称为较小的测试版本DF_sim),我将其转换为相异矩阵DF_dissm = 1 - DF_sim.

我使用它作为输入linkage来自,scipy但文档说它采用方形或三角形矩阵.我得到一个不同的集群用于输入查询一个lower triangle,upper trianglesquare matrix.为什么是这样?它需要文档中的上三角形,但下三角形簇看起来非常相似.

我的问题是,为什么所有的集群都不同?哪一个是正确的?

这是输入距离矩阵的文档 linkage

y : ndarray
A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. 
Run Code Online (Sandbox Code Playgroud)

这是我的代码:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage

%matplotlib inline

#Test Data
DF_sim = DF_c93tom.iloc[:10,:10] #Similarity Matrix
DF_sim.columns = DF_sim.index = range(10) 
#print(DF_test)
#           0  1         2         3  4  5  6  7  8  9
# 0  1.000000  0  0.395833  0.083333  0  0  0  0  0  0
# 1  0.000000  1  0.000000  0.000000  0  0  0  0  0  0
# 2  0.395833  0  1.000000  0.883792  0  0  0  0  0  0
# 3  0.083333  0  0.883792  1.000000  0  0  0  0  0  0
# 4  0.000000  0  0.000000  0.000000  1  0  0  0  0  0
# 5  0.000000  0  0.000000  0.000000  0  1  0  0  0  0
# 6  0.000000  0  0.000000  0.000000  0  0  1  0  0  0
# 7  0.000000  0  0.000000  0.000000  0  0  0  1  0  0
# 8  0.000000  0  0.000000  0.000000  0  0  0  0  1  0
# 9  0.000000  0  0.000000  0.000000  0  0  0  0  0  1

#Dissimilarity Matrix
DF_dissm = 1 - DF_sim

#Redundant Matrix
#np.tril(DF_dissm).T == np.triu(DF_dissm)
#True for all values

#Hierarchical Clustering for square and triangle matrices
fig_1 = plt.figure(1)
plt.title("Square")
Z_square = linkage((DF_dissm.values),method="average")
dendrogram(Z_square)

fig_2 = plt.figure(2)
plt.title("Triangle Upper")
Z_triu = linkage(np.triu(DF_dissm.values),method="average")
dendrogram(Z_triu)

fig_3 = plt.figure(3)
plt.title("Triangle Lower")
Z_tril = linkage(np.tril(DF_dissm.values),method="average")
dendrogram(Z_tril)

plt.show()
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

unu*_*tbu 7

当2D数组作为第一个参数传递给scipy.cluster.hierarchy.linkage时,它被视为一系列观察,并scipy.spatial.pdist用于将其转换为观察之间成对距离的序列.

关于此行为存在github问题,因为它意味着传递诸如DF_dissm.values(静默)的"距离矩阵" 会产生不正确的结果.

因此,结果就是没有这些

Z_square = linkage((DF_dissm.values),method="average")
Z_triu = linkage(np.triu(DF_dissm.values),method="average")
Z_tril = linkage(np.tril(DF_dissm.values),method="average")
Run Code Online (Sandbox Code Playgroud)

产生预期的结果.而是使用


import matplotlib.pyplot as plt
import numpy as np
from scipy.cluster import hierarchy as hier
from scipy.spatial import distance as ssd
np.random.seed(2016)

points = np.random.random((10, 2))
arr = ssd.cdist(points, points)

fig, ax = plt.subplots(nrows=4)

ax[0].set_title("condensed upper triangular")
Z = hier.linkage(arr[np.triu_indices(arr.shape[0], 1)], method="average")
hier.dendrogram(Z, ax=ax[0])

ax[1].set_title("squareform")
Z = hier.linkage(ssd.squareform(arr), method="average")
hier.dendrogram(Z, ax=ax[1])

ax[2].set_title("pdist")
Z = hier.linkage(ssd.pdist(points), method="average")
hier.dendrogram(Z, ax=ax[2])

ax[3].set_title("sequence of observations")
Z = hier.linkage(points, method="average")
hier.dendrogram(Z, ax=ax[3])

plt.show()
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述