对大数据字符串进行比较,使用data.table,grepl或类似方法

Hac*_*k-R 4 r data.table

我需要检查一列中的字符串是否包含来自另一列的同一行的相应(数字)值,用于所有行.

如果我只是检查单个模式的字符串,那么使用data.table like或者是直截了当的grepl.但是,我的模式值对于每一行都是不同的.

有一个有点相关的问题在这里,但不像这个问题,我需要创建表示如果该模式存在一个逻辑标志.

让我们说这是我的数据集;

DT <- structure(list(category = c("administration", "nurse practitioner", 
                                  "trucking", "administration", "warehousing", "warehousing", "trucking", 
                                  "nurse practitioner", "nurse practitioner"), industry = c("admin", 
                                                                                            "truck", "truck", "admin", "nurse", "admin", "truck", "nurse", 
                                                                                            "truck")), .Names = c("category", "industry"), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                               -9L))
setDT(DT)
> DT
             category industry
1:     administration    admin
2: nurse practitioner    truck
3:           trucking    truck
4:     administration    admin
5:        warehousing    nurse
6:        warehousing    admin
7:           trucking    truck
8: nurse practitioner    nurse
9: nurse practitioner    truck
Run Code Online (Sandbox Code Playgroud)

我想要的结果将是这样的矢量:

> DT
   matches
1: TRUE
2: FALSE
3: TRUE
4: TRUE
5: FALSE
6: FALSE
7: TRUE
8: TRUE
9: FALSE
Run Code Online (Sandbox Code Playgroud)

当然,1和0将与TRUE和FALSE一样好.

以下是我试过的一些不起作用的东西:

apply(DT,1,grepl, pattern = DT[,2], x = DT[,1])
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> apply(DT,1,grepl, pattern = DT[,1], x = DT[,2])
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> grepl(DT[,2], DT[,1])
[1] FALSE

> DT[Vectorize(grepl)(industry, category, fixed = TRUE)]
             category industry
1:     administration    admin
2:           trucking    truck
3:     administration    admin
4:           trucking    truck
5: nurse practitioner    nurse

> DT[stringi::stri_detect_fixed(category, industry)]
             category industry
1:     administration    admin
2:           trucking    truck
3:     administration    admin
4:           trucking    truck
5: nurse practitioner    nurse

> for(i in 1:nrow(DT)){print(grepl(DT[i,2], DT[i,1]))}
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE

> for(i in 1:nrow(DT)){print(grepl(DT[i,2], DT[i,1], fixed = T))}
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE

> DT[category %like% industry]
         category industry
1: administration    admin
2: administration    admin
Warning message:
In grepl(pattern, vector) :
  argument 'pattern' has length > 1 and only the first element will be used
Run Code Online (Sandbox Code Playgroud)

akr*_*run 6

在OP的代码中,,没有使用.因此,基于该data.table方法,它将对与i索引对应的行进行子集化.

但是,如果我们指定,我们正在使用它j,那么我们得到逻辑向量

DT[, stri_detect_fixed(category, industry)]
#[1]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
Run Code Online (Sandbox Code Playgroud)

假设,我们将它保存在a中list,然后我们得到data.table一个列

DT[, list(match=stri_detect_fixed(category, industry))]
Run Code Online (Sandbox Code Playgroud)