如果符合条件,则删除字符串中的最后两个字符

wak*_*ake 4 regex string r character gsub

我在数据库中有200万个名字.例如:

df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE A", "A SCHWARZENEGGER"))

> df
             names
1           A ADAM
2           S BEAN
3        A APPLE A
4 A SCHWARZENEGGER
Run Code Online (Sandbox Code Playgroud)

' A'如果这些是字符串的最后两个字符,我想删除(空格A).

我知道正则表达式是我们的朋友.如何有效地将正则表达式函数应用于字符串的最后两个字符?

期望的输出:

> output
             names
1           A ADAM
2           S BEAN
3          A APPLE
4 A SCHWARZENEGGER
Run Code Online (Sandbox Code Playgroud)

bar*_*nus 6

如果您想要数百万条记录的良好性能,那么该stringi软件包就是您所需要的.它甚至优于基本R功能:

require(stringi)
n <- 10000
x <- stri_rand_strings(n, 1:100)
ind <- sample(n, n/100)
x[ind] <- stri_paste(x[ind]," A")

baseR <- function(x){
  sub("\\sA$", "", x)
}

stri1 <- function(x){
  stri_replace_last_regex(x, "\\sA$","")
}

stri2 <- function(x){
  ind <- stri_detect_regex(x, "\\sA$")
  x[ind] <- stri_sub(x[ind],1, -3)
  x
}

#if we assume that there can only be space, not any white character
#this is even faster (ca 200x)
stri3 <- function(x){
  ind <- stri_endswith_fixed(x, " A")
  x[ind] <- stri_sub(x[ind],1, -3)
  x
}


head(stri2(x),44)
require(microbenchmark)
microbenchmark(baseR(x), stri1(x),stri2(x),stri3(x))
Unit: microseconds
     expr        min        lq        mean      median         uq        max neval
 baseR(x) 166044.032 172054.30 183919.6684 183112.1765 194586.231 219207.905   100
 stri1(x)  36704.180  39015.59  41836.8612  40164.9365  43773.034  60373.866   100
 stri2(x)  17736.535  18884.56  20575.3306  19818.2895  21759.489  31846.582   100
 stri3(x)    491.963    802.27    918.1626    868.9935   1008.776   2489.923   100
Run Code Online (Sandbox Code Playgroud)


akr*_*run 5

我们可以sub用来匹配字符串\\s结尾处的空格,后跟'A',$并用空格("")替换它

df$names <- sub("\\sA$", "", df$names)
df$names
#[1] "A ADAM"           "S BEAN"           "A APPLE"          "A SCHWARZENEGGER"
Run Code Online (Sandbox Code Playgroud)

  • 评论效率,使用200万行并使用以下命令计时执行:`df = data.frame(names = rep(c("A ADAM","S BEAN","A APPLE A","SCHWARZENEGGER"), length.out = 2000000)); curr.time = proc.time(); df $ names = sub("\\ sA $","",df $ names); proc.time() - curr.time; 删除(curr.time)`,需要1.3秒.我认为使用字符串而不是因素可能会更慢,即`data.frame(names,stringsAsFactors = F)`但它具有可比性.这不仅仅是足够有效的IMO. (2认同)