删除R中的常量列

Question

删除R中的常量列

当我收到此错误时,我正在使用prcomp函数

Error in prcomp.default(x, ...) : 
cannot rescale a constant/zero column to unit variance

Run Code Online (Sandbox Code Playgroud)

我知道我可以手动扫描我的数据,但R中是否有任何函数或命令可以帮助我删除这些常量变量？我知道这是一个非常简单的任务,但我从来没有遇到任何这样做的功能.

谢谢,

Answer 1

jub*_*uba 38

这里的问题是您的列方差等于零.您可以通过这种方式检查数据框的哪一列是常量,例如:

df <- data.frame(x=1:5, y=rep(1,5))
df
#   x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1

# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y"

Run Code Online (Sandbox Code Playgroud)

因此,如果要排除这些列,可以使用:

df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]

Run Code Online (Sandbox Code Playgroud)

编辑:事实上,使用它更简单apply.像这样的东西:

df[,apply(df, 2, var, na.rm=TRUE) != 0]

Run Code Online (Sandbox Code Playgroud)

Answer 2

ray*_*how 13

我猜这个Q&A是一个受欢迎的Google搜索结果,但对于大型矩阵来说答案有点慢,而且我没有足够的声誉来评论第一个答案.因此,我发布了一个新问题的答案.

对于大矩阵的每列,检查最大值是否等于最小值就足够了.

df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]

Run Code Online (Sandbox Code Playgroud)

这是考验.与第一个答案相比,90%以上的时间都减少了.它也比问题第二条评论的答案要快.

ncol = 1000000
nrow = 10
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix

time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]) # my method
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 })]) # Keith's method

time1
#   user  system elapsed 
# 22.267   0.194  22.626 
time2
#   user  system elapsed 
#  2.073   0.077   2.155 
time3
#   user  system elapsed 
#  6.702   0.060   6.790
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE

Run Code Online (Sandbox Code Playgroud)

我重申并发现使用all(x == x [1],na.rm = TRUE)快15%,而不是计算max和min. (2认同)

Answer 3

Emm*_*Lin 6

由于此问答是一个受欢迎的Google搜索结果，但是对于大型矩阵，答案有点慢，而对于NAs，@ raymkchow版本的响应很慢，因此我建议使用指数搜索和data.table幂的新版本。

我在dataPreparation包中实现了此功能。

首先建立一个示例性的data.table，其行数多于列（通常是这种情况），并且具有10％的NA

ncol = 1000
nrow = 100000
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df <- apply (df, 2, function(x) {x[sample( c(1:nrow), floor(nrow/10))] <- NA; x} ) # Add 10% of NAs
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
df <- as.data.table(df)

Run Code Online (Sandbox Code Playgroud)

然后对所有方法进行基准测试：

time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0, with = F]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE)), with = F]) # raymkchow
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 }), with = F]) # Keith's method
time4 <- system.time(df4 <- df[,-whichAreConstant(df, verbose=FALSE)]) # My method

Run Code Online (Sandbox Code Playgroud)

结果如下：

time1 # Variance approch
#   user  system elapsed 
#   2.55    1.45    4.07
time2 # Min = max approach
#   user  system elapsed 
#  2.72      1.5    4.22
time3 # length(unique()) approach
#   user  system elapsed 
#    6.7    2.75    9.53
time4 # Exponential search approach
#   user  system elapsed 
#   0.39    0.07    0.45
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
all.equal(df4, df2)
# [1] TRUE

Run Code Online (Sandbox Code Playgroud)

dataPreparation:whichAreConstant 比其他方法快10倍。

再加上行数越多，使用起来越有趣。

归档时间：	12 年，12 月前
查看次数：	23112 次
最近记录：	8 年，3 月前