快速计算字符向量中的字符

Los*_*Lin 3 r

我有一个很长的单个字符向量,即somechars<-c("A","B","C","A"...)(长度在数百万的某个地方)

我能计算出这个向量中"A"和"B"的总出现次数的最快方法是什么?我已经尝试使用greplapply,但他们都需要很长时间来执行.

我目前的解决方案是:

tmp<-table(somechars)
sum(tmp["A"],tmp["B"])
Run Code Online (Sandbox Code Playgroud)

但这还需要一段时间来计算.有没有更快的方法可以做到这一点?或者,我可以使用任何包,这已经更快了吗?我查看了stringr包,但他们使用了一个简单的grep.

Joh*_*ohn 9

我认为这会是最快的......

sum(somechars %in% c('A', 'B'))
Run Code Online (Sandbox Code Playgroud)

而且,它比...更快

sum(c(somechars=="A",somechars=="B"))
Run Code Online (Sandbox Code Playgroud)

但不比......快

sum(somechars=="A"|somechars=="B")
Run Code Online (Sandbox Code Playgroud)

但这取决于你做了多少比较...这让我回到了我的第一个猜测.一旦你想使用%in%版本总和超过2个字母是最快的.


Jos*_*ich 8

正则表达式很昂贵.您可以通过精确比较获得问题的结果.

> somechars <- sample(LETTERS, 5e6, TRUE)
> sum(c(somechars=="A",somechars=="B"))
[1] 385675
> system.time(sum(c(somechars=="A",somechars=="B")))
   user  system elapsed 
  0.416   0.072   0.487 
Run Code Online (Sandbox Code Playgroud)

更新以包括OP和其他答案的时间安排.还包括一个大于2个字符的测试案例.

> library(rbenchmark)
> benchmark( replications=5, order="relative",
+   grep = sum(grepl("A|B",somechars)),
+   table = sum(table(somechars)[c("A","B")]),
+   c = sum(c(somechars=="A",somechars=="B")),
+   OR = sum(somechars=="A"|somechars=="B"),
+   IN = sum(somechars %in% c("A","B")),
+   plus = sum(somechars=="A")+sum(somechars=="B") )
   test replications elapsed relative user.self sys.self user.child sys.child
6  plus            5   4.289 1.000000     3.836    0.436          0         0
3     c            5   4.991 1.163675     4.156    0.804          0         0
5    IN            5   5.480 1.277687     4.549    0.880          0         0
4    OR            5   5.574 1.299604     5.000    0.544          0         0
1  grep            5  16.426 3.829797    16.205    0.172          0         0
2 table            5  17.834 4.158079    12.793    4.884          0         0
> 
> benchmark( replications=5, order="relative",
+   grep = sum(grepl("A|B|C|D",somechars)),
+   table = sum(table(somechars)[c("A","B","C","D")]),
+   c = sum(c(somechars=="A",somechars=="B",
+             somechars=="C",somechars=="D")),
+   OR = sum(somechars=="A"|somechars=="B"|
+            somechars=="C"|somechars=="D"),
+   IN = sum(somechars %in% c("A","B","C","D")),
+   plus = sum(somechars=="A")+sum(somechars=="B")+
+          sum(somechars=="C")+sum(somechars=="D") )
   test replications elapsed relative user.self sys.self user.child sys.child
5    IN            5   5.513 1.000000     4.464    1.004          0         0
6  plus            5   8.603 1.560493     7.705    0.860          0         0
3     c            5  10.283 1.865228     8.648    1.560          0         0
4    OR            5  12.348 2.239797    10.849    1.464          0         0
2 table            5  17.960 3.257754    12.877    4.921          0         0
1  grep            5  21.692 3.934700    21.405    0.192          0         0
Run Code Online (Sandbox Code Playgroud)

  • @TomasT.如果你试图让你的评论变得积极,你就会结交更多的朋友并影响更多的人,而不是指出答案有缺陷.例如,你可以写出"很好的答案 - 如果不连接A和B,这将更快" (4认同)