使用grep从data.table中对行进行子集化,比较行内容

Question

使用grep从data.table中对行进行子集化,比较行内容

Oze*_*uss 2 grep r string-matching data.table

DT <- data.table(num=c("20031111","1112003","23423","2222004"),y=c("2003","2003","2003","2004"))

> DT
    num    y
1: 20031111 2003
2:  1112003 2003
3:    23423 2003
4:  2222004 2004

Run Code Online (Sandbox Code Playgroud)

我想比较两个单元格内容,并根据布尔值执行操作.例如,如果"num"与年份匹配,则创建一个包含该值的列x.我考虑过基于grep的子集化,这是有效的,但每次都会自然检查整个列,这看起来很浪费

DT[grep(y,num)] # works with a pattern>1 warning

Run Code Online (Sandbox Code Playgroud)

我可以申请()我的方式,但也许有一个data.table方式？

谢谢

Answer 1

Nic*_*edy 5

如果您对使用该stringi软件包感到满意,这是一种利用stringi函数向量化图形和字符串这一事实的方法:

DT[stri_detect_fixed(num, y), x := num])

Run Code Online (Sandbox Code Playgroud)

根据数据,它可能比Veerenda Gadekar发布的方法更快.

DT <- data.table(num=paste0(sample(1000), sample(2001:2010, 1000, TRUE)),
                 y=as.character(sample(2001:2010, 1000, TRUE)))
microbenchmark(
    vg = DT[, x := grep(y, num, value=TRUE, fixed=TRUE), by = .(num, y)],
    nk = DT[stri_detect_fixed(num, y), x := num]
)

#Unit: microseconds
# expr      min       lq     mean   median       uq      max neval
#   vg 6027.674 6176.397 6513.860 6278.689 6370.789 9590.398   100
#   nk  975.260 1007.591 1116.594 1047.334 1110.734 3833.051   100

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，2 月前
查看次数：	1506 次
最近记录：	9 年，10 月前