dis*_*lus 0 grouping r correlation
我有一个数据框,我想找出哪一组变量共享最高的相关性。例如:
mydata <- structure(list(V1 = c(1L, 2L, 5L, 4L, 366L, 65L, 43L, 456L, 876L, 78L, 687L, 378L, 378L, 34L, 53L, 43L),
V2 = c(2L, 2L, 5L, 4L, 366L, 65L, 43L, 456L, 876L, 78L, 687L, 378L, 378L, 34L, 53L, 41L),
V3 = c(10L, 20L, 10L, 20L, 10L, 20L, 1L, 0L, 1L, 2010L,20L, 10L, 10L, 10L, 10L, 10L),
V4 = c(2L, 10L, 31L, 2L, 2L, 5L, 2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 1L),
V5 = c(4L, 10L, 31L, 2L, 2L, 5L, 2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 3L)),
.Names = c("V1", "V2", "V3", "V4", "V5"),
class = "data.frame", row.names = c(NA,-16L))
Run Code Online (Sandbox Code Playgroud)
我可以计算相关系数,并找到每对具有高于阈值的相关系数的对:
var.corelation <- cor(as.matrix(mydata), method="pearson")
fin.corr = as.data.frame( as.table( var.corelation ) )
combinations_1 = combn( colnames( var.corelation ) , 2 , FUN = function( x ) paste( x , collapse = "_" ) )
fin.corr = fin.corr[ fin.corr$Var1 != fin.corr$Var2 , ]
fin.corr = fin.corr [order(fin.corr$Freq, decreasing = TRUE) , ,drop = FALSE]
fin.corr = fin.corr[ paste( fin.corr$Var1 , fin.corr$Var2 , sep = "_" ) %in% combinations_1 , ]
fin.corr <- fin.corr[fin.corr$Freq > 0.62, ]
fin.corr <- fin.corr[order(fin.corr$Var1, fin.corr$Var2), ]
fin.corr
Run Code Online (Sandbox Code Playgroud)
到目前为止的输出是:
Var1 Var2 Freq
V1 V2 0.9999978
V3 V4 0.6212136
V3 V5 0.6220380
V4 V5 0.9992690
Run Code Online (Sandbox Code Playgroud)
这里V1
和V2
形式的基团,而其他V3
,V4
,V5
形成另一基团,其中每对变量具有相关性高于阈值。我想将这两组变量作为列表。例如
list(c("V1", "V2"), c("V3", "V4", "V5"))
Run Code Online (Sandbox Code Playgroud)
使用图论和igraph
软件包得到了答案。
var.corelation <- cor(as.matrix(mydata), method="pearson")
library(igraph)
# prevent duplicated pairs
var.corelation <- var.corelation*lower.tri(var.corelation)
check.corelation <- which(var.corelation>0.62, arr.ind=TRUE)
graph.cor <- graph.data.frame(check.corelation, directed = FALSE)
groups.cor <- split(unique(as.vector(check.corelation)), clusters(graph.cor)$membership)
lapply(groups.cor,FUN=function(list.cor){rownames(var.corelation)[list.cor]})
Run Code Online (Sandbox Code Playgroud)
返回:
$`1`
[1] "V1" "V2"
$`2`
[1] "V3" "V4" "V5"
Run Code Online (Sandbox Code Playgroud)
我还要检查一下我的评论,因为对我来说,相关性可能小于(任意)临界点,但实际上与集群相关联,因此可以为我带来更好的见解。