ois*_*tat 5 r cluster-analysis hierarchical-clustering hclust correlation
我在做用的R包叫做分层聚类pvclust,其基础上hclust通过将引导来计算得到的集群显着性水平.
考虑以下具有3维和10个观察的数据集:
mat <- as.matrix(data.frame("A"=c(9000,2,238),"B"=c(10000,6,224),"C"=c(1001,3,259),
"D"=c(9580,94,51),"E"=c(9328,5,248),"F"=c(10000,100,50),
"G"=c(1020,2,240),"H"=c(1012,3,260),"I"=c(1012,3,260),
"J"=c(984,98,49)))
Run Code Online (Sandbox Code Playgroud)
当我hclust单独使用时,聚类对欧几里得测量和相关度量都运行良好:
# euclidean-based distance
dist1 <- dist(t(mat),method="euclidean")
mat.cl1 <- hclust(dist1,method="average")
# correlation-based distance
dist2 <- as.dist(1 - cor(mat))
mat.cl2 <- hclust(dist2, method="average")
Run Code Online (Sandbox Code Playgroud)
但是,在使用每个设置时pvclust,如下:
library(pvclust)
# euclidean-based distance
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean", nboot=1000)
# correlation-based distance
mat.pcl2 <- pvclust(mat, method.hclust="average", method.dist="correlation", nboot=1000)
Run Code Online (Sandbox Code Playgroud)
...我收到以下错误:
Error in hclust(distance, method = method.hclust) :
must have n >= 2 objects to clusterError in cor(x, method = "pearson", use = use.cor) :
supply both 'x' and 'y' or a matrix-like 'x'.注意,距离的计算是pvclust这样的,因此不需要事先计算距离.另请注意,该hclust方法(平均值,中位数等)不会影响问题.
当我将数据集的维度增加到4时,pvclust现在运行正常.为什么我pvclust在3维及以下的版本中收到这些错误但不是hclust?此外,当我使用4维以上的数据集时,为什么错误会消失?
在函数的末尾pvclust我们看到一行
mboot <- lapply(r, boot.hclust, data = data, object.hclust = data.hclust,
nboot = nboot, method.dist = method.dist, use.cor = use.cor,
method.hclust = method.hclust, store = store, weight = weight)
Run Code Online (Sandbox Code Playgroud)
然后深入挖掘我们发现
getAnywhere("boot.hclust")
function (r, data, object.hclust, method.dist, use.cor, method.hclust,
nboot, store, weight = F)
{
n <- nrow(data)
size <- round(n * r, digits = 0)
....
smpl <- sample(1:n, size, replace = TRUE)
suppressWarnings(distance <- dist.pvclust(data[smpl,
], method = method.dist, use.cor = use.cor))
....
}
Run Code Online (Sandbox Code Playgroud)
r另请注意,函数参数的pvclust默认值为r=seq(.5,1.4,by=.1)。好吧,实际上我们可以看到这个值正在某个地方发生改变:
Bootstrap (r = 0.33)...
Run Code Online (Sandbox Code Playgroud)
所以我们得到的是size <- round(3 * 0.33, digits =0)which is 1,最终data[smpl,]只有 1 行,小于 2 行。更正后,r它返回一些可能无害的错误,并且也给出了输出:
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean",
nboot=1000, r=seq(0.7,1.4,by=.1))
Bootstrap (r = 0.67)... Done.
....
Bootstrap (r = 1.33)... Done.
Warning message:
In a$p[] <- c(1, bp[r == 1]) :
number of items to replace is not a multiple of replacement length
Run Code Online (Sandbox Code Playgroud)
如果结果令人满意,请告诉我。