将树状图切割成n个树,在R中具有最小簇尺寸

Bry*_*yan 6 r distance hierarchical-clustering

我正在尝试使用抽象聚类(特别是hclust)将数据集聚类成10个组,大小不超过100个成员,并且没有任何组占总人口的40%以上.我目前知道的唯一方法是重复使用cut()并选择持续较低的h水平,直到我对切割的分散感到满意为止.然而,这迫使我返回并重新聚集我修剪的组,将它们聚合成100个成员组,这可能非常耗时.

我已经尝试过这个dynamicTreeCut包,但无法弄清楚如何输入这些(相对简单的)限制.我正在使用deepSplit指定分组数量的方式,但是根据文档,这会将最大数量限制为4.对于下面的练习,我要做的就是将群集分成5组3或更多的人(我可以自己处理最大的尺寸限制,但如果你想尝试解决这个问题,那将会有所帮助!).

这是我的例子,使用Orange数据集.

library(dynamicTreeCut)
library(reshape2)

##creating 14 individuals from Orange's original 5
Orange1<-Orange
Orange1$Tree<-as.numeric(as.character(Orange1$Tree))
Orange2<-Orange1
Orange3<-Orange1
Orange2$Tree=Orange2$Tree+6
Orange3$Tree=Orange3$Tree+11
combOr<-rbind(Orange1, Orange2[1:28,], Orange3)


####casting the data to make a correlation matrix, and then running 
#### a hierarchical cluster
castOrange<-dcast(combOr, age~Tree, mean, fill=0)
castOrange[,16]<-c(1,34,5,35,34,35,21)
castOrange[,17]<-c(1,34,5,35,34,35,21)
orangeCorr<-cor(castOrange[, -1])
orangeClust<-hclust(dist(orangeCorr))

###running the dynamic tree cut
dynamicCut<-cutreeDynamic(orangeClust, minClusterSize=3, method="tree", deepSplit=4)

dynamicCut
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
Run Code Online (Sandbox Code Playgroud)

如您所见,它只指定两个集群.对于我的运动,我想避免使用明确的高度术语来切割树木,因为我想要k一些树木.

Ham*_*adr 7

1-找出最合适的相异性量度(例如,"euclidean","maximum","manhattan","canberra","binary",或"minkowski")和联动方法(例如,"ward","single","complete","average","mcquitty","median",或"centroid")基于数据的性质和聚类的目标(一个或多个).见?dist?hclust更多的细节.

2-在开始切割步骤之前绘制树状图树.有关?hclust详细信息,请参阅

3-在dynamicTreeCut包中使用混合自适应树切割方法,并调整形状参数(maxCoreScatterminGap/ maxAbsCoreScatterminAbsGap).参见Langfelder等人.2009年(http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/BranchCutting/Supplement.pdf).


以你为例,

1- 适当的变更"euclidean"和/或"complete"方法,

orangeClust <- hclust(dist(orangeCorr, method="euclidean"), method="complete")
Run Code Online (Sandbox Code Playgroud)

2-绘图树状图,

plot(orangeClust)
Run Code Online (Sandbox Code Playgroud)

3-使用混合树切割方法并调整形状参数,

dynamicCut <- cutreeDynamic(orangeClust, minClusterSize=3, method="hybrid", distM=as.matrix(dist(orangeCorr, method="euclidean")), deepSplit=4, maxCoreScatter=NULL, minGap=NULL, maxAbsCoreScatter=NULL, minAbsGap=NULL)
dynamicCut
 ..cutHeight not given, setting it to 1.8  ===>  99% of the (truncated) height range in dendro.
 ..done.
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
Run Code Online (Sandbox Code Playgroud)

作为调整形状参数的指南,默认值为

deepSplit=0: maxCoreScatter = 0.64 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=1: maxCoreScatter = 0.73 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=2: maxCoreScatter = 0.82 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=3: maxCoreScatter = 0.91 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=4: maxCoreScatter = 0.95 & minGap = (1 - maxCoreScatter) * 3/4
Run Code Online (Sandbox Code Playgroud)

正如你所看到的,maxCoreScatter并且minGap应该是之间01,并增加maxCoreScatter(减少minGap)增加群集的数量(尺寸较小).Langfelder等人描述了这些参数的含义.2009年.

例如,要获得更小的集群

maxCoreScatter <- 0.99
minGap <- (1 - maxCoreScatter) * 3/4
dynamicCut <- cutreeDynamic(orangeClust, minClusterSize=3, method="hybrid", distM=as.matrix(dist(orangeCorr, method="euclidean")), deepSplit=4, maxCoreScatter=maxCoreScatter, minGap=minGap, maxAbsCoreScatter=NULL, minAbsGap=NULL)
dynamicCut
 ..cutHeight not given, setting it to 1.8  ===>  99% of the (truncated) height range in dendro.
 ..done.
 2 3 2 2 2 3 3 2 2 3 3 2 2 2 1 2 1 1 1 2 2 1 1 2 2 1 1 1 0 0
Run Code Online (Sandbox Code Playgroud)

最后,您的聚类约束(大小,高度,数量等)应该是合理且可解释的,并且生成的聚类应该与数据一致.这将指导您进行聚类验证和解释的重要步骤.


祝好运!