在数据表中分割长度可变的字符串

Question

在数据表中分割长度可变的字符串

我想根据另一列中的一部分字符串创建一个列。

参考列遵循以下通用格式：GB / 12月31日

在这种情况下，我想提取单词“ Ling”，并且长度不一。

到目前为止，我的方法是：

library(data.table)
d1 <- data.table(MENU_HINT = 
                 c("GB / Ling 31st Dec", "GB / Taun 30th Dec", 
                   "GB / Ayr 19th Dec", "GB / Ayr 9th Nov", 
                   "GB / ChelmC 29th Sep"), 
             Track = c("Ling", "Taun", "Ayr", "Ayr", "ChelmC"))

#remove all the spaces
d1[, Track2 := gsub("[[:space:]]", "", MENU_HINT)]

# get the position of the first digit
d1[, x := as.numeric(regexpr("[[:digit:]]", Track2)[[1]])]

# get the position of the '/'
d1[, y := as.numeric(regexpr("/", Track2))[[1]]]

# use above to extract the Track
d1[, Track2 := substr(Track2, y + 1, x - 1)]

Run Code Online (Sandbox Code Playgroud)

Track是我期望得到的，Track2是我从上面的代码中得到的。

这似乎很麻烦，而且似乎也不起作用，因为x和y值在整个列中都相同。

Answer 1

Dav*_*urg 5

我不会为此使用正则表达式-对于大数据集来说效率不高。您所寻找的词似乎总是位于第二个空格之后。一个非常简单有效的解决方案是

d1[, Track2 := tstrsplit(MENU_HINT, " ", fixed = TRUE)[[3]]]

Run Code Online (Sandbox Code Playgroud)

基准测试

bigDT <- data.table(MENU_HINT = sample(d1$MENU_HINT, 1e6, replace = TRUE))
microbenchmark::microbenchmark("sub: " = sub("\\S+[[:punct:] ]+(\\S+).*", "\\1", bigDT$MENU_HINT),
                               "gsub: " = gsub("^[^/]+/\\s*|\\s+.*$", "", bigDT$MENU_HINT),
                               "tstrsplit: " = tstrsplit(bigDT$MENU_HINT, " ", fixed = TRUE)[[3]])
# Unit: milliseconds
#        expr       min        lq      mean    median        uq      max neval
#       sub:   982.1185  998.6264 1058.1576 1025.8775 1083.1613 1405.051   100
#      gsub:  1236.9453 1262.6014 1320.4436 1305.6711 1339.2879 1766.027   100
# tstrsplit:   385.4785  452.6476  498.8681  470.8281  537.5499 1044.691   100

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年前
查看次数：	148 次
最近记录：	9 年前