计算此因子中的"0"数

Rem*_*i.b 5 string parsing r pattern-matching

考虑以下因素

x = factor(c("1|1","1|0","1|1","1|1","0|0","1|1","0|1"))
Run Code Online (Sandbox Code Playgroud)

我想计算这个因素中字符"0"的出现次数.到目前为止我找到的唯一解决方案是

sum(grepl("0",strsplit(paste(sapply(x, as.character), collapse=""), split="")[[1]]))
# [1] 4
Run Code Online (Sandbox Code Playgroud)

对于这样一个简单的过程,这个解决方案似乎很复 有没有"更好"的选择?(由于该过程将在2000个元素长的因素上重复约100,000次,我可能最终也会关注性能.)

Sat*_*ish 7

x = factor(c("1|1","1|0","1|1","1|1","0|0","1|1","0|1"))
x
# [1] 1|1 1|0 1|1 1|1 0|0 1|1 0|1
# Levels: 0|0 0|1 1|0 1|1

sum( unlist( lapply( strsplit(as.character(x), "|"), function( x ) length(grep( '0', x ))) ) )
# [1] 4
Run Code Online (Sandbox Code Playgroud)

要么

sum(nchar(gsub("[1 |]", '', x )))
# [1] 4
Run Code Online (Sandbox Code Playgroud)

基于@Rich Scriven的评论

sum(nchar(gsub("[^0]", '', x )))
# [1] 4
Run Code Online (Sandbox Code Playgroud)

根据@ thelatemail的评论 - 使用tabulate比上述解决方案更快的工作.这是比较.

sum(nchar(gsub("[^0]", "", levels(x) )) * tabulate(x))
Run Code Online (Sandbox Code Playgroud)

时间档案:

x2 <- sample(x,1e7,replace=TRUE)
system.time(sum(nchar(gsub("[^0]", '', x2 ))));
# user  system elapsed 
# 14.24    0.22   14.65 
system.time(sum(nchar(gsub("[^0]", "", levels(x2) )) * tabulate(x2)));
# user  system elapsed 
# 0.04    0.00    0.04 
system.time(sum(str_count(x2, fixed("0"))))
# user  system elapsed 
# 1.02    0.13    1.25
Run Code Online (Sandbox Code Playgroud)

  • 如果你正在处理一个非常大的向量,你可以通过仅运行'x`的`levels`来节省时间 - sum(nchar(gsub("[^ 0]","",levels(x)) )*制表(x))` (4认同)

Ric*_*ven 6

这有三个选项.

选项1: scan()使用的向量sep="|"

sum(scan(text=as.character(x), sep="|") == 0)
# [1] 4
Run Code Online (Sandbox Code Playgroud)

选项2:固定字符gregexpr()

sum(unlist(gregexpr("0", x, fixed=TRUE)) > 0)
# [1] 4
Run Code Online (Sandbox Code Playgroud)

选项3:带有stringr的非常简单快速的打包选项

library(stringr)
sum(str_count(x, fixed("0")))
# [1] 4
Run Code Online (Sandbox Code Playgroud)