我有以下类型(但是有很多变量和ind)数据:
mydf <- data.frame (Inv = 1:6, varA = c(1,1,1, 0,1,1),
varB = c(1,0,1, 0, 1,1), varC = c(1,0,0, 0,1,1), varD = c(1,1,1, 0,1,1),
varE = c(1,0,1, 0, 1,1), varF = c(1,1,1, 0, 1,1))
mydf
Inv varA varB varC varD varE varF
1 1 1 1 1 1 1 1
2 2 1 0 0 1 0 1
3 3 1 1 0 1 1 1
4 4 0 0 0 0 0 0
5 5 1 1 1 1 1 1
6 6 1 1 1 1 1 1
Run Code Online (Sandbox Code Playgroud)
我想做一对一的比较(包括变量和个体/主题),如果它们是重复的,只保留一个,并将重复的个体/变量的名称作为日志保存到不同的文件中:
例如,在以上数据中:
变量包括:
varA is exactly same as varD and varF - so I will just keep varA only in new data
mydf$varA == mydf$varE
[1] TRUE TRUE TRUE TRUE TRUE TRUE
varB and varE has exactly same data - so I will just keep varB
varC is unique
Run Code Online (Sandbox Code Playgroud)
在Inv(即科目)中:
1, 5 and 6 are same -> so just keep 1
Run Code Online (Sandbox Code Playgroud)
因此得到的输出文件是
mydf <- data.frame (Inv = 1:4, varA = c(1,1,1, 0),
varB = c(1,0,1, 0), varC = c(1,0,0, 0))
Inv varA varB varC
1 1 1 1 1
2 2 1 0 0
3 3 1 1 0
4 4 0 0 0
Run Code Online (Sandbox Code Playgroud)
我可以通过相关矩阵找到重复:
cor(mydf[,-1])
varA varB varC varD varE varF
varA 1.0000000 0.6324555 0.4472136 1.0000000 0.6324555 1.0000000
varB 0.6324555 1.0000000 0.7071068 0.6324555 1.0000000 0.6324555
varC 0.4472136 0.7071068 1.0000000 0.4472136 0.7071068 0.4472136
varD 1.0000000 0.6324555 0.4472136 1.0000000 0.6324555 1.0000000
varE 0.6324555 1.0000000 0.7071068 0.6324555 1.0000000 0.6324555
varF 1.0000000 0.6324555 0.4472136 1.0000000 0.6324555 1.0000000
Run Code Online (Sandbox Code Playgroud)
我们自动化这个过程吗?
您也可以findCorrelation从caret包中使用:
findCorrelation(x, cutoff = .90, verbose = FALSE)
Run Code Online (Sandbox Code Playgroud)
其中输出是索引的向量,表示要删除的列.
这应该可以解决问题:
dat <- mydf[-1]
cMat <- abs(cor(dat)) >= (1 - .Machine$double.eps^0.5)
whichKeep <- which(rowSums(lower.tri(cMat) * cMat) == 0)
cbind(mydf[1], mydf[whichKeep + 1])
Inv varA varB varC
1 1 1 1 1
2 2 1 0 0
3 3 1 1 0
4 4 0 0 0
5 5 1 1 1
6 6 1 1 1
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
5934 次 |
| 最近记录: |