我有一个很长的单个字符向量,即somechars<-c("A","B","C","A"...)(长度在数百万的某个地方)
我能计算出这个向量中"A"和"B"的总出现次数的最快方法是什么?我已经尝试使用grep和lapply,但他们都需要很长时间来执行.
我目前的解决方案是:
tmp<-table(somechars)
sum(tmp["A"],tmp["B"])
Run Code Online (Sandbox Code Playgroud)
但这还需要一段时间来计算.有没有更快的方法可以做到这一点?或者,我可以使用任何包,这已经更快了吗?我查看了stringr包,但他们使用了一个简单的grep.
我认为这会是最快的......
sum(somechars %in% c('A', 'B'))
Run Code Online (Sandbox Code Playgroud)
而且,它比...更快
sum(c(somechars=="A",somechars=="B"))
Run Code Online (Sandbox Code Playgroud)
但不比......快
sum(somechars=="A"|somechars=="B")
Run Code Online (Sandbox Code Playgroud)
但这取决于你做了多少比较...这让我回到了我的第一个猜测.一旦你想使用%in%版本总和超过2个字母是最快的.
正则表达式很昂贵.您可以通过精确比较获得问题的结果.
> somechars <- sample(LETTERS, 5e6, TRUE)
> sum(c(somechars=="A",somechars=="B"))
[1] 385675
> system.time(sum(c(somechars=="A",somechars=="B")))
user system elapsed
0.416 0.072 0.487
Run Code Online (Sandbox Code Playgroud)
更新以包括OP和其他答案的时间安排.还包括一个大于2个字符的测试案例.
> library(rbenchmark)
> benchmark( replications=5, order="relative",
+ grep = sum(grepl("A|B",somechars)),
+ table = sum(table(somechars)[c("A","B")]),
+ c = sum(c(somechars=="A",somechars=="B")),
+ OR = sum(somechars=="A"|somechars=="B"),
+ IN = sum(somechars %in% c("A","B")),
+ plus = sum(somechars=="A")+sum(somechars=="B") )
test replications elapsed relative user.self sys.self user.child sys.child
6 plus 5 4.289 1.000000 3.836 0.436 0 0
3 c 5 4.991 1.163675 4.156 0.804 0 0
5 IN 5 5.480 1.277687 4.549 0.880 0 0
4 OR 5 5.574 1.299604 5.000 0.544 0 0
1 grep 5 16.426 3.829797 16.205 0.172 0 0
2 table 5 17.834 4.158079 12.793 4.884 0 0
>
> benchmark( replications=5, order="relative",
+ grep = sum(grepl("A|B|C|D",somechars)),
+ table = sum(table(somechars)[c("A","B","C","D")]),
+ c = sum(c(somechars=="A",somechars=="B",
+ somechars=="C",somechars=="D")),
+ OR = sum(somechars=="A"|somechars=="B"|
+ somechars=="C"|somechars=="D"),
+ IN = sum(somechars %in% c("A","B","C","D")),
+ plus = sum(somechars=="A")+sum(somechars=="B")+
+ sum(somechars=="C")+sum(somechars=="D") )
test replications elapsed relative user.self sys.self user.child sys.child
5 IN 5 5.513 1.000000 4.464 1.004 0 0
6 plus 5 8.603 1.560493 7.705 0.860 0 0
3 c 5 10.283 1.865228 8.648 1.560 0 0
4 OR 5 12.348 2.239797 10.849 1.464 0 0
2 table 5 17.960 3.257754 12.877 4.921 0 0
1 grep 5 21.692 3.934700 21.405 0.192 0 0
Run Code Online (Sandbox Code Playgroud)