mdk*_*mdk 1 regex split r data.table
我有一个列包含以下信息:
1 x=abc1000000\ty=pqr2000000\tz=olk78fgzu_zuii8999_ikooo
2 x=oljhh88999\ty=lop9876666
3 x=frdt876544\ty=ztr6u76532\ty=uzrt899963\tz=wertttts_765342_ioooosww\tz=tzuuuee_66554422_88uuiiid
Run Code Online (Sandbox Code Playgroud)
因此,没有一种id类型在一行中有一定数量的出现.它们全部由制表符分隔.我正在寻找一种方法来获取每一行的ID作为单独的列,并找到了tstrsplit内部,data.table但无法弄清楚如何使用多个拆分参数.有任何想法吗?
编辑:预期的格式是:
x1 x2 y1 y2 z1 z2
1 abc1000000 NA pqr2000000 NA olk78fgzu_zuii8999_ikooo NA
2 oljhh88999 NA lop9876666 NA NA NA
3 frdt876544 NA ztr6u76532 uzrt899963 wertttts_765342_ioooosww zuuuee_66554422_88uuiiid
Run Code Online (Sandbox Code Playgroud)
请注意,id的名称中不包含"id".所以我相应地更新了这个例子.id可能在每一行中出现多次.上面给出的格式只是使问题更清晰的一个例子.实际上,例如,X类型的ID在一行中可以有20个出现.然后,X的列数将是给定整个数据集的行中此特定类型ID的最大出现次数.数据非常大.我们正在谈论大约30米的线路.
新答案:
对于更新的示例,您可以按如下方式解决问题:
dt2 <- dt[, rn := .I
][, .(V1 = unlist(tstrsplit(V1, '\t'))), by = rn
][, c('id','value') := tstrsplit(V1, '=')
][, idn := 1:.N, by = .(rn, id)]
dcast(dt2, rn ~ id + idn, value.var = 'value', sep = '')
Run Code Online (Sandbox Code Playgroud)
这导致:
Run Code Online (Sandbox Code Playgroud)rn x1 y1 y2 z1 z2 1: 1 abc1000000 pqr2000000 NA olk78fgzu_zuii8999_ikooo NA 2: 2 oljhh88999 lop9876666 NA NA NA 3: 3 frdt876544 ztr6u76532 uzrt899963 wertttts_765342_ioooosww tzuuuee_66554422_88uuiiid
要获得准确的输出(因此也包括一x2列),您可以:
dcast(dt2[CJ(rn = rn, id = id, idn = idn, unique = TRUE), on = .(rn, id, idn)],
rn ~ id + idn, value.var = 'value', sep = '')
Run Code Online (Sandbox Code Playgroud)
这导致:
Run Code Online (Sandbox Code Playgroud)rn x1 x2 y1 y2 z1 z2 1: 1 abc1000000 NA pqr2000000 NA olk78fgzu_zuii8999_ikooo NA 2: 2 oljhh88999 NA lop9876666 NA NA NA 3: 3 frdt876544 NA ztr6u76532 uzrt899963 wertttts_765342_ioooosww tzuuuee_66554422_88uuiiid
使用数据:
dt <- fread('"x=abc1000000\ty=pqr2000000\tz=olk78fgzu_zuii8999_ikooo"
"x=oljhh88999\ty=lop9876666"
"x=frdt876544\ty=ztr6u76532\ty=uzrt899963\tz=wertttts_765342_ioooosww\tz=tzuuuee_66554422_88uuiiid"',
header=FALSE)
Run Code Online (Sandbox Code Playgroud)
回答原始问题:
如果您想使用tstrsplit,可以按如下方式进行操作:
dt[, rn := .I
][, .(V1 = unlist(tstrsplit(V1, '\t'))), by = rn
][, .(rn, id = gsub('([a-z0-9]+)(=.*$)','\\1',V1))]
Run Code Online (Sandbox Code Playgroud)
这导致:
Run Code Online (Sandbox Code Playgroud)rn id 1: 1 xid1 2: 1 yid2 3: 1 zid3 4: 2 xid4 5: 2 yid5 6: 3 xid6 7: 3 yid7 8: 3 yid8 9: 3 zid9
另一种导致宽格式输出的替代方案:
dt[, tstrsplit(V1, '\t'),
][, lapply(.SD, gsub, pattern = '([a-z0-9]+)(=.*$)', replacement = '\\1')]
Run Code Online (Sandbox Code Playgroud)
这导致:
Run Code Online (Sandbox Code Playgroud)V1 V2 V3 V4 1: xid1 yid2 zid3 NA 2: xid4 yid5 NA NA 3: xid6 yid7 yid8 zid9
如果你想提取所有id为@UweBlock,你也可以这样做(虽然比UweBlock的方法稍微简单一点):
l <- regmatches(dt$V1, gregexpr('([a-z]{1}id[0-9]{1})',dt$V1))
l <- lapply(l, as.data.frame.list)
l <- lapply(l, function(x) {names(x) <- paste0('v',seq_along(x)); as.data.table(x)})
rbindlist(l, fill = TRUE)
Run Code Online (Sandbox Code Playgroud)
这导致:
Run Code Online (Sandbox Code Playgroud)v1 v2 v3 v4 v5 1: xid1 yid2 zid3 NA NA 2: xid4 yid5 NA NA NA 3: xid6 yid7 yid8 zid8 zid9
使用数据:
dt <- fread('"xid1=abc1000000\tyid2=pqr2000000\tzid3=olk78fgzu_zuii8999_ikooo"
"xid4=oljhh88999\tyid5=lop9876666"
"xid6=frdt876544\tyid7=ztr6u76532\tyid8=uzrt899963tzid8=wertttts_765342_ioooosww\tzid9=tzuuuee_66554422_88uuiiid"',header=FALSE)
Run Code Online (Sandbox Code Playgroud)
您没有指定输出应该是什么样子.为了击败akrun到答案,这里是一个列表,其中列表的元素代表你的行.
在此解决方案中,您可以通过选项卡拆分每一行,并找到[xyz] id [integer]的模式.
x <- c("xid1=abc1000000\tyid2=pqr2000000\tzid3=olk78fgzu_zuii8999_ikooo",
"xid4=oljhh88999\tyid5=lop9876666",
"xid6=frdt876544\tyid7=ztr6u76532\tyid8=uzrt899963tzid8=wertttts_765342_ioooosww\tzid9=tzuuuee_66554422_88uuiiid")
res <- sapply(x, FUN = function(m) {
m <- strsplit(m, "\t")
out <- sapply(m, FUN = function(o) gsub(pattern = "(^[[:alpha:]]id\\d+)(=.*$)", replacement = "\\1", x = o),
simplify = FALSE)
out
}, simplify = FALSE)
res <- unname(res)
res
[[1]]
[[1]][[1]]
[1] "xid1" "yid2" "zid3"
[[2]]
[[2]][[1]]
[1] "xid4" "yid5"
[[3]]
[[3]][[1]]
[1] "xid6" "yid7" "yid8" "zid9"
Run Code Online (Sandbox Code Playgroud)
如果省略simplify = FALSE并且不取消对结果的命名,则可以获得
$`xid1=abc1000000\tyid2=pqr2000000\tzid3=olk78fgzu_zuii8999_ikooo`
[,1]
[1,] "xid1"
[2,] "yid2"
[3,] "zid3"
$`xid4=oljhh88999\tyid5=lop9876666`
[,1]
[1,] "xid4"
[2,] "yid5"
$`xid6=frdt876544\tyid7=ztr6u76532\tyid8=uzrt899963tzid8=wertttts_765342_ioooosww\tzid9=tzuuuee_66554422_88uuiiid`
[,1]
[1,] "xid6"
[2,] "yid7"
[3,] "yid8"
[4,] "zid9"
Run Code Online (Sandbox Code Playgroud)
如果您不关心每个元素的来源,您可以这样做
rapply(as.list(x), f = function(m){
m <- strsplit(m, "\t")
out <- sapply(m, FUN = function(o) gsub(pattern = "(^[[:alpha:]]id\\d+)(=.*$)", replacement = "\\1", x = o),
simplify = FALSE)
})
[1] "xid1" "yid2" "zid3" "xid4" "yid5" "xid6" "yid7" "yid8" "zid9"
Run Code Online (Sandbox Code Playgroud)
但即便如此,也可以使用第一种解决方案(计算每个列表中元素的数量)来重建.