Eti*_*rie 92 regex r dataframe
我有一个data.frame,其中某些变量包含文本字符串.我希望计算每个字符串中给定字符的出现次数.
例:
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"))
Run Code Online (Sandbox Code Playgroud)
我希望为q.data创建一个新列,其中包含字符串中"a"的出现次数(即c(2,1,0)).
我管理的唯一令人费解的方法是:
string.counter<-function(strings, pattern){
counts<-NULL
for(i in 1:length(strings)){
counts[i]<-length(attr(gregexpr(pattern,strings[i])[[1]], "match.length")[attr(gregexpr(pattern,strings[i])[[1]], "match.length")>0])
}
return(counts)
}
string.counter(strings=q.data$string, pattern="a")
number string number.of.a
1 1 greatgreat 2
2 2 magic 1
3 3 not 0
Run Code Online (Sandbox Code Playgroud)
Das*_*son 126
stringr包提供了str_count似乎做你感兴趣的功能
# Load your example data
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = F)
library(stringr)
# Count the number of 'a's in each element of string
q.data$number.of.a <- str_count(q.data$string, "a")
q.data
# number string number.of.a
#1 1 greatgreat 2
#2 2 magic 1
#3 3 not 0
Run Code Online (Sandbox Code Playgroud)
Jos*_*ien 56
如果你不想离开基地R,这里有一个相当简洁和富有表现力的可能性:
x <- q.data$string
lengths(regmatches(x, gregexpr("a", x)))
# [1] 2 1 0
Run Code Online (Sandbox Code Playgroud)
42-*_*42- 15
nchar(as.character(q.data$string)) -nchar( gsub("a", "", q.data$string))
[1] 2 1 0
Run Code Online (Sandbox Code Playgroud)
请注意,在传递给nchar之前,我将factor变量强制转换为character.正则表达式函数似乎在内部执行此操作.
这是基准测试结果(测试的大小扩大到3000行)
q.data<-q.data[rep(1:NROW(q.data), 1000),]
str(q.data)
'data.frame': 3000 obs. of 3 variables:
$ number : int 1 2 3 1 2 3 1 2 3 1 ...
$ string : Factor w/ 3 levels "greatgreat","magic",..: 1 2 3 1 2 3 1 2 3 1 ...
$ number.of.a: int 2 1 0 2 1 0 2 1 0 2 ...
benchmark( Dason = { q.data$number.of.a <- str_count(as.character(q.data$string), "a") },
Tim = {resT <- sapply(as.character(q.data$string), function(x, letter = "a"){
sum(unlist(strsplit(x, split = "")) == letter) }) },
DWin = {resW <- nchar(as.character(q.data$string)) -nchar( gsub("a", "", q.data$string))},
Josh = {x <- sapply(regmatches(q.data$string, gregexpr("g",q.data$string )), length)}, replications=100)
#-----------------------
test replications elapsed relative user.self sys.self user.child sys.child
1 Dason 100 4.173 9.959427 2.985 1.204 0 0
3 DWin 100 0.419 1.000000 0.417 0.003 0 0
4 Josh 100 18.635 44.474940 17.883 0.827 0 0
2 Tim 100 3.705 8.842482 3.646 0.072 0 0
Run Code Online (Sandbox Code Playgroud)
该stringi软件包提供的功能stri_count和stri_count_fixed它的速度非常快。
stringi::stri_count(q.data$string, fixed = "a")
# [1] 2 1 0
Run Code Online (Sandbox Code Playgroud)
基准
与@42- 的答案中的最快方法以及包中具有 30.000 个元素的向量的等效函数相比stringr。
library(microbenchmark)
benchmark <- microbenchmark(
stringi = stringi::stri_count(test.data$string, fixed = "a"),
baseR = nchar(test.data$string) - nchar(gsub("a", "", test.data$string, fixed = TRUE)),
stringr = str_count(test.data$string, "a")
)
autoplot(benchmark)
Run Code Online (Sandbox Code Playgroud)
数据
q.data <- data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = FALSE)
test.data <- q.data[rep(1:NROW(q.data), 10000),]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
87943 次 |
| 最近记录: |