Ger*_*ine 7 r cluster-analysis
我试图在我的数据(100行x 130列)上运行包NbClust来确定我应该选择的簇数,但是如果我尝试将它应用于完整数据集,我会不断收到此错误:
> nc <- NbClust(mydata, distance="euclidean", min.nc=2, max.nc=99, method="ward",
index="duda")
[1] "There are only 100 nonmissing observations out of a possible 100 observations."
Error in NbClust(mydata, distance = "euclidean", min.nc = 2, max.nc = 99, :
The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.
Run Code Online (Sandbox Code Playgroud)
当我将方法应用于100x80矩阵时,它确实产生了所需的输出(100x100也给了我一个错误信息,但是不同的一个).但是,显然,我想将此方法应用于整个数据集.仅供参考 - 创建距离矩阵,并使用Ward's Method进行聚类都没有问题.距离矩阵和树状图都是......
我很确定我找到了此错误消息的原因,并且它本质上与数据相关.我查找了NbClust包的原始代码,发现错误源自代码的开头部分:
NbClust <- function(data, diss="NULL", distance = "euclidean", min.nc=2, max.nc=15, method = "ward", index = "all", alphaBeale = 0.1)
{
x<-0
min_nc <- min.nc
max_nc <- max.nc
jeu1 <- as.matrix(data)
numberObsBefore <- dim(jeu1)[1]
jeu <- na.omit(jeu1) # returns the object with incomplete cases removed
nn <- numberObsAfter <- dim(jeu)[1]
pp <- dim(jeu)[2]
TT <- t(jeu)%*%jeu
sizeEigenTT <- length(eigen(TT)$value)
eigenValues <- eigen(TT/(nn-1))$value
for (i in 1:sizeEigenTT)
{
if (eigenValues[i] < 0) {
print(paste("There are only", numberObsAfter,"nonmissing observations out of a possible", numberObsBefore ,"observations."))
stop("The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.")
}
}
Run Code Online (Sandbox Code Playgroud)
所以,就我而言,我的矩阵产生负特征值.我仔细检查了这一点,并且确实:最多约100个主要子矩阵,特征值保持正值,然后它们开始变为负值.所以这是我的矩阵的数学问题,这意味着它不是一个正定矩阵.这有很多原因很重要 - 在http://www2.gsu.edu/~mkteer/npdmatri.html上给出了一个非常好的原因解释和可能的解决方案. 我现在正在分析我的数据以找出导致这种情况的原因.因此代码很好:如果您收到此错误消息,则可能需要返回到您的数据.
我会提醒您不要转置您的数据,因为这样您实际上将转置数据(即原始数据)的转置与转置数据相乘.转置的原始时间与原始转置时间不同!!
我不知道该函数会发生什么,但您可以通过循环应用不同的方法:(如果您想应用此代码,则必须更改“base_muli_sinna”)
lista.methods = c("kl", "ch", "hartigan","mcclain", "gamma", "gplus",
"tau", "dunn", "sdindex", "sdbw", "cindex", "silhouette",
"ball","ptbiserial", "gap","frey")
lista.distance = c("metodo","euclidean", "maximum", "manhattan", "canberra")
tabla = as.data.frame(matrix(ncol = length(lista.distance), nrow = length(lista.methods)))
names(tabla) = lista.distance
for (j in 2:length(lista.distance)){
for(i in 1:length(lista.methods)){
nb = NbClust(base_multi_sinna, distance = lista.distance[j],
min.nc = 2, max.nc = 10,
method = "complete", index =lista.methods[i])
tabla[i,j] = nb$Best.nc[1]
tabla[i,1] = lista.methods[i]
}}
tabla
Run Code Online (Sandbox Code Playgroud)