我有一个data.frame:
SelectVar
a b c d e f g h i j k l ll m n o p q r
1 Dxa8 Dxa8 0 Dxa8 Dxa8 0 Dxa8 Dxa8 0 0 0 0 0 0 0 0 0 Dxc8 0
2 Dxb8 Dxc8 0 Dxe8 Dxi8 0 tneg tpos 0 0 0 0 0 0 0 0 0 Dxi8 0
Run Code Online (Sandbox Code Playgroud)
我想删除数据框中两行中零值的列,因此它产生如下数据框:
SelectVar
a b d e g h q
1 Dxa8 Dxa8 Dxa8 Dxa8 Dxa8 Dxa8 Dxc8
2 Dxb8 Dxc8 Dxe8 Dxi8 tneg tpos Dxi8
Run Code Online (Sandbox Code Playgroud)
试过:
SelectVar!=0
Run Code Online (Sandbox Code Playgroud)
产生一个真/假数据帧,并:
SelectVar[, colSums(abs(SelectVar)) ! == 0]
Run Code Online (Sandbox Code Playgroud)
这会产生错误.
如何删除每行中值为零的列?
Mat*_*erg 39
你几乎拥有它.把这两个放在一起:
SelectVar[, colSums(SelectVar != 0) > 0]
Run Code Online (Sandbox Code Playgroud)
这是因为因子列被评估为> = 1的数值.
tmf*_*mnk 10
一种选择dplyr 1.0.0可能是:
df %>%
select(where(~ any(. != 0)))
a b d e g h q
1 Dxa8 Dxa8 Dxa8 Dxa8 Dxa8 Dxa8 Dxc8
2 Dxb8 Dxc8 Dxe8 Dxi8 tneg tpos Dxi8
Run Code Online (Sandbox Code Playgroud)
根据平均执行时间,更快的选项是大约 40%
df[,-(which(colSums(df)==0))]
我们可以使用一个包含 3,000 列和两个观察值的简单示例数据框对这两个选项进行基准测试。
# Create simple 2 X 3000 data frame with many 1s and 0s
# 500 columns have all 0s
df = matrix(c(rep(c(0,1,1),1000),rep(c(1,0,0),1000)),nrow=2)
df = as.data.frame(df)
# Benchmark the two options in milliseconds, 100 times
library(microbenchmark)
microbenchmark(
df[,colSums(df != 0) > 0],
df[,-(which(colSums(df)==0))]
)
Unit: milliseconds
expr min lq mean median uq max neval
df[, colSums(df != 0) > 0] 23.3844 24.77905 30.24852 26.37730 29.17175 140.6486 100
df[, -(which(colSums(df) == 0))] 17.3664 19.12815 21.58901 20.59055 22.29905 41.9485 100
Run Code Online (Sandbox Code Playgroud)