dis*_*lus 0 grouping r correlation
我有一个数据框,我想找出哪一组变量共享最高的相关性。例如:
mydata <- structure(list(V1 = c(1L, 2L, 5L, 4L, 366L, 65L, 43L, 456L, 876L, 78L, 687L, 378L, 378L, 34L, 53L, 43L),
V2 = c(2L, 2L, 5L, 4L, 366L, 65L, 43L, 456L, 876L, 78L, 687L, 378L, 378L, 34L, 53L, 41L),
V3 = c(10L, 20L, 10L, 20L, 10L, 20L, 1L, 0L, 1L, 2010L,20L, 10L, 10L, 10L, 10L, 10L),
V4 = c(2L, 10L, 31L, 2L, 2L, 5L, 2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 1L),
V5 = c(4L, 10L, 31L, 2L, 2L, 5L, 2L, 5L, 1L, 52L, 1L, 2L, 52L, 6L, 2L, 3L)),
.Names = c("V1", "V2", "V3", "V4", "V5"),
class = "data.frame", row.names = c(NA,-16L))
Run Code Online (Sandbox Code Playgroud)
我可以计算相关系数,并找到每对具有高于阈值的相关系数的对:
var.corelation <- cor(as.matrix(mydata), method="pearson")
fin.corr = as.data.frame( as.table( var.corelation ) )
combinations_1 = combn( colnames( var.corelation ) , 2 , FUN = function( x ) paste( x , collapse = "_" ) )
fin.corr = fin.corr[ fin.corr$Var1 != fin.corr$Var2 , ]
fin.corr = fin.corr [order(fin.corr$Freq, decreasing = TRUE) , ,drop = FALSE]
fin.corr = fin.corr[ paste( fin.corr$Var1 , fin.corr$Var2 , sep = "_" ) %in% combinations_1 , ]
fin.corr <- fin.corr[fin.corr$Freq > 0.62, ]
fin.corr <- fin.corr[order(fin.corr$Var1, fin.corr$Var2), ]
fin.corr
Run Code Online (Sandbox Code Playgroud)
到目前为止的输出是:
Var1 Var2 Freq
V1 V2 0.9999978
V3 V4 0.6212136
V3 V5 0.6220380
V4 V5 0.9992690
Run Code Online (Sandbox Code Playgroud)
这里V1和V2形式的基团,而其他V3,V4,V5形成另一基团,其中每对变量具有相关性高于阈值。我想将这两组变量作为列表。例如
list(c("V1", "V2"), c("V3", "V4", "V5"))
Run Code Online (Sandbox Code Playgroud)
使用图论和igraph软件包得到了答案。
var.corelation <- cor(as.matrix(mydata), method="pearson")
library(igraph)
# prevent duplicated pairs
var.corelation <- var.corelation*lower.tri(var.corelation)
check.corelation <- which(var.corelation>0.62, arr.ind=TRUE)
graph.cor <- graph.data.frame(check.corelation, directed = FALSE)
groups.cor <- split(unique(as.vector(check.corelation)), clusters(graph.cor)$membership)
lapply(groups.cor,FUN=function(list.cor){rownames(var.corelation)[list.cor]})
Run Code Online (Sandbox Code Playgroud)
返回:
$`1`
[1] "V1" "V2"
$`2`
[1] "V3" "V4" "V5"
Run Code Online (Sandbox Code Playgroud)
我还要检查一下我的评论,因为对我来说,相关性可能小于(任意)临界点,但实际上与集群相关联,因此可以为我带来更好的见解。
| 归档时间: |
|
| 查看次数: |
1247 次 |
| 最近记录: |