使用正则表达式将列拆分为单独的列

mdk*_*mdk 1 regex split r data.table

我有一个列包含以下信息:

1 x=abc1000000\ty=pqr2000000\tz=olk78fgzu_zuii8999_ikooo
2 x=oljhh88999\ty=lop9876666
3 x=frdt876544\ty=ztr6u76532\ty=uzrt899963\tz=wertttts_765342_ioooosww\tz=tzuuuee_66554422_88uuiiid
Run Code Online (Sandbox Code Playgroud)

因此,没有一种id类型在一行中有一定数量的出现.它们全部由制表符分隔.我正在寻找一种方法来获取每一行的ID作为单独的列,并找到了tstrsplit内部,data.table但无法弄清楚如何使用多个拆分参数.有任何想法吗?

编辑:预期的格式是:

     x1          x2        y1    y2                          z1                  z2
1  abc1000000   NA pqr2000000    NA         olk78fgzu_zuii8999_ikooo            NA    
2  oljhh88999   NA lop9876666    NA                               NA            NA
3  frdt876544   NA ztr6u76532   uzrt899963  wertttts_765342_ioooosww  zuuuee_66554422_88uuiiid
Run Code Online (Sandbox Code Playgroud)

请注意,id的名称中不包含"id".所以我相应地更新了这个例子.id可能在每一行中出现多次.上面给出的格式只是使问题更清晰的一个例子.实际上,例如,X类型的ID在一行中可以有20个出现.然后,X的列数将是给定整个数据集的行中此特定类型ID的最大出现次数.数据非常大.我们正在谈论大约30米的线路.

Jaa*_*aap 6

新答案:

对于更新的示例,您可以按如下方式解决问题:

dt2 <- dt[, rn := .I
          ][, .(V1 = unlist(tstrsplit(V1, '\t'))), by = rn
            ][, c('id','value') := tstrsplit(V1, '=')
              ][, idn := 1:.N, by = .(rn, id)]

dcast(dt2, rn ~ id + idn, value.var = 'value', sep = '')
Run Code Online (Sandbox Code Playgroud)

这导致:

   rn         x1         y1         y2                       z1                        z2
1:  1 abc1000000 pqr2000000         NA olk78fgzu_zuii8999_ikooo                        NA
2:  2 oljhh88999 lop9876666         NA                       NA                        NA
3:  3 frdt876544 ztr6u76532 uzrt899963 wertttts_765342_ioooosww tzuuuee_66554422_88uuiiid
Run Code Online (Sandbox Code Playgroud)

要获得准确的输出(因此也包括一x2列),您可以:

dcast(dt2[CJ(rn = rn, id = id, idn = idn, unique = TRUE), on = .(rn, id, idn)], 
      rn ~ id + idn, value.var = 'value', sep = '')
Run Code Online (Sandbox Code Playgroud)

这导致:

   rn         x1 x2         y1         y2                       z1                        z2
1:  1 abc1000000 NA pqr2000000         NA olk78fgzu_zuii8999_ikooo                        NA
2:  2 oljhh88999 NA lop9876666         NA                       NA                        NA
3:  3 frdt876544 NA ztr6u76532 uzrt899963 wertttts_765342_ioooosww tzuuuee_66554422_88uuiiid
Run Code Online (Sandbox Code Playgroud)

使用数据:

dt <- fread('"x=abc1000000\ty=pqr2000000\tz=olk78fgzu_zuii8999_ikooo"
             "x=oljhh88999\ty=lop9876666"
             "x=frdt876544\ty=ztr6u76532\ty=uzrt899963\tz=wertttts_765342_ioooosww\tz=tzuuuee_66554422_88uuiiid"',
            header=FALSE)
Run Code Online (Sandbox Code Playgroud)

回答原始问题:

如果您想使用tstrsplit,可以按如下方式进行操作:

dt[, rn := .I
   ][, .(V1 = unlist(tstrsplit(V1, '\t'))), by = rn
     ][, .(rn, id = gsub('([a-z0-9]+)(=.*$)','\\1',V1))]
Run Code Online (Sandbox Code Playgroud)

这导致:

   rn   id
1:  1 xid1
2:  1 yid2
3:  1 zid3
4:  2 xid4
5:  2 yid5
6:  3 xid6
7:  3 yid7
8:  3 yid8
9:  3 zid9
Run Code Online (Sandbox Code Playgroud)

另一种导致宽格式输出的替代方案:

dt[, tstrsplit(V1, '\t'),
   ][, lapply(.SD, gsub, pattern = '([a-z0-9]+)(=.*$)', replacement = '\\1')]
Run Code Online (Sandbox Code Playgroud)

这导致:

     V1   V2   V3   V4
1: xid1 yid2 zid3   NA
2: xid4 yid5   NA   NA
3: xid6 yid7 yid8 zid9
Run Code Online (Sandbox Code Playgroud)

如果你想提取所有id为@UweBlock,你也可以这样做(虽然比UweBlock的方法稍微简单一点):

l <- regmatches(dt$V1, gregexpr('([a-z]{1}id[0-9]{1})',dt$V1))
l <- lapply(l, as.data.frame.list)
l <- lapply(l, function(x) {names(x) <- paste0('v',seq_along(x)); as.data.table(x)})

rbindlist(l, fill = TRUE)
Run Code Online (Sandbox Code Playgroud)

这导致:

     v1   v2   v3   v4   v5
1: xid1 yid2 zid3   NA   NA
2: xid4 yid5   NA   NA   NA
3: xid6 yid7 yid8 zid8 zid9
Run Code Online (Sandbox Code Playgroud)

使用数据:

dt <- fread('"xid1=abc1000000\tyid2=pqr2000000\tzid3=olk78fgzu_zuii8999_ikooo"
"xid4=oljhh88999\tyid5=lop9876666"
"xid6=frdt876544\tyid7=ztr6u76532\tyid8=uzrt899963tzid8=wertttts_765342_ioooosww\tzid9=tzuuuee_66554422_88uuiiid"',header=FALSE)
Run Code Online (Sandbox Code Playgroud)


Rom*_*rik 5

您没有指定输出应该是什么样子.为了击败akrun到答案,这里是一个列表,其中列表的元素代表你的行.

在此解决方案中,您可以通过选项卡拆分每一行,并找到[xyz] id [integer]的模式.

x <- c("xid1=abc1000000\tyid2=pqr2000000\tzid3=olk78fgzu_zuii8999_ikooo",
       "xid4=oljhh88999\tyid5=lop9876666",
       "xid6=frdt876544\tyid7=ztr6u76532\tyid8=uzrt899963tzid8=wertttts_765342_ioooosww\tzid9=tzuuuee_66554422_88uuiiid")

res <- sapply(x, FUN = function(m) {
  m <- strsplit(m, "\t")
  out <- sapply(m, FUN = function(o) gsub(pattern = "(^[[:alpha:]]id\\d+)(=.*$)", replacement = "\\1", x = o), 
         simplify = FALSE)
  out
  }, simplify = FALSE)

res <- unname(res)
res

[[1]]
[[1]][[1]]
[1] "xid1" "yid2" "zid3"


[[2]]
[[2]][[1]]
[1] "xid4" "yid5"


[[3]]
[[3]][[1]]
[1] "xid6" "yid7" "yid8" "zid9"
Run Code Online (Sandbox Code Playgroud)

如果省略simplify = FALSE并且不取消对结果的命名,则可以获得

$`xid1=abc1000000\tyid2=pqr2000000\tzid3=olk78fgzu_zuii8999_ikooo`
     [,1]  
[1,] "xid1"
[2,] "yid2"
[3,] "zid3"

$`xid4=oljhh88999\tyid5=lop9876666`
     [,1]  
[1,] "xid4"
[2,] "yid5"

$`xid6=frdt876544\tyid7=ztr6u76532\tyid8=uzrt899963tzid8=wertttts_765342_ioooosww\tzid9=tzuuuee_66554422_88uuiiid`
     [,1]  
[1,] "xid6"
[2,] "yid7"
[3,] "yid8"
[4,] "zid9"
Run Code Online (Sandbox Code Playgroud)

如果您不关心每个元素的来源,您可以这样做

rapply(as.list(x), f = function(m){
  m <- strsplit(m, "\t")
  out <- sapply(m, FUN = function(o) gsub(pattern = "(^[[:alpha:]]id\\d+)(=.*$)", replacement = "\\1", x = o), 
                simplify = FALSE)
})

[1] "xid1" "yid2" "zid3" "xid4" "yid5" "xid6" "yid7" "yid8" "zid9"
Run Code Online (Sandbox Code Playgroud)

但即便如此,也可以使用第一种解决方案(计算每个列表中元素的数量)来重建.