O.r*_*rka 4 python numpy machine-learning hierarchical-clustering scipy
我一直在试验,Hierarchical Clustering
而且R
它很简单hclust(as.dist(X),method="average")
.我发现一个方法Python
也很简单,除了我对输入距离矩阵发生的事情感到有点困惑.
我有一个相似性矩阵(DF_c93tom
称为较小的测试版本DF_sim
),我将其转换为相异矩阵DF_dissm = 1 - DF_sim
.
我使用它作为输入linkage
来自,scipy
但文档说它采用方形或三角形矩阵.我得到一个不同的集群用于输入查询一个lower triangle
,upper triangle
和square matrix
.为什么是这样?它需要文档中的上三角形,但下三角形簇看起来非常相似.
我的问题是,为什么所有的集群都不同?哪一个是正确的?
这是输入距离矩阵的文档 linkage
y : ndarray
A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix.
Run Code Online (Sandbox Code Playgroud)
这是我的代码:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage
%matplotlib inline
#Test Data
DF_sim = DF_c93tom.iloc[:10,:10] #Similarity Matrix
DF_sim.columns = DF_sim.index = range(10)
#print(DF_test)
# 0 1 2 3 4 5 6 7 8 9
# 0 1.000000 0 0.395833 0.083333 0 0 0 0 0 0
# 1 0.000000 1 0.000000 0.000000 0 0 0 0 0 0
# 2 0.395833 0 1.000000 0.883792 0 0 0 0 0 0
# 3 0.083333 0 0.883792 1.000000 0 0 0 0 0 0
# 4 0.000000 0 0.000000 0.000000 1 0 0 0 0 0
# 5 0.000000 0 0.000000 0.000000 0 1 0 0 0 0
# 6 0.000000 0 0.000000 0.000000 0 0 1 0 0 0
# 7 0.000000 0 0.000000 0.000000 0 0 0 1 0 0
# 8 0.000000 0 0.000000 0.000000 0 0 0 0 1 0
# 9 0.000000 0 0.000000 0.000000 0 0 0 0 0 1
#Dissimilarity Matrix
DF_dissm = 1 - DF_sim
#Redundant Matrix
#np.tril(DF_dissm).T == np.triu(DF_dissm)
#True for all values
#Hierarchical Clustering for square and triangle matrices
fig_1 = plt.figure(1)
plt.title("Square")
Z_square = linkage((DF_dissm.values),method="average")
dendrogram(Z_square)
fig_2 = plt.figure(2)
plt.title("Triangle Upper")
Z_triu = linkage(np.triu(DF_dissm.values),method="average")
dendrogram(Z_triu)
fig_3 = plt.figure(3)
plt.title("Triangle Lower")
Z_tril = linkage(np.tril(DF_dissm.values),method="average")
dendrogram(Z_tril)
plt.show()
Run Code Online (Sandbox Code Playgroud)
当2D数组作为第一个参数传递给scipy.cluster.hierarchy.linkage时,它被视为一系列观察,并scipy.spatial.pdist
用于将其转换为观察之间成对距离的序列.
关于此行为存在github问题,因为它意味着传递诸如DF_dissm.values
(静默)的"距离矩阵" 会产生不正确的结果.
因此,结果就是没有这些
Z_square = linkage((DF_dissm.values),method="average")
Z_triu = linkage(np.triu(DF_dissm.values),method="average")
Z_tril = linkage(np.tril(DF_dissm.values),method="average")
Run Code Online (Sandbox Code Playgroud)
产生预期的结果.而是使用
h, w = arr.shape
Z = linkage(arr[np.triu_indices(h, 1)], method="average")
Run Code Online (Sandbox Code Playgroud)或者spatial.distance.squareform
:
from scipy.spatial import distance as ssd
Z = linkage(ssd.squareform(arr), method="average")
Run Code Online (Sandbox Code Playgroud)或申请spatial.distance.pdist
原始积分:
Z = hierarchy.linkage(ssd.pdist(points), method="average")
Run Code Online (Sandbox Code Playgroud)或传递2D数组points
:
Z = hierarchy.linkage(points, method="average")
Run Code Online (Sandbox Code Playgroud)import matplotlib.pyplot as plt
import numpy as np
from scipy.cluster import hierarchy as hier
from scipy.spatial import distance as ssd
np.random.seed(2016)
points = np.random.random((10, 2))
arr = ssd.cdist(points, points)
fig, ax = plt.subplots(nrows=4)
ax[0].set_title("condensed upper triangular")
Z = hier.linkage(arr[np.triu_indices(arr.shape[0], 1)], method="average")
hier.dendrogram(Z, ax=ax[0])
ax[1].set_title("squareform")
Z = hier.linkage(ssd.squareform(arr), method="average")
hier.dendrogram(Z, ax=ax[1])
ax[2].set_title("pdist")
Z = hier.linkage(ssd.pdist(points), method="average")
hier.dendrogram(Z, ax=ax[2])
ax[3].set_title("sequence of observations")
Z = hier.linkage(points, method="average")
hier.dendrogram(Z, ax=ax[3])
plt.show()
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
1662 次 |
最近记录: |