Dmi*_*kin 0 regex dictionary r data.table
我正在寻找一种快速有效的扩展字典解决方案(df1)
pattern cat1 cat2
1 I want [food] a b
2 I'm [amplifier] [pos]. a b
df1 <- data.frame(pattern=c("I want [food]", "I'm [amplifier] [pos]"),
cat1=c("a", "c"), cat2=c("b", "d"), stringsAsFactors=FALSE)
Run Code Online (Sandbox Code Playgroud)
具有字符串模式,其中一些类别包含在方括号[]中.这些表示以字典格式(df2)出现在附加数据框中的类别.
pattern category
1 pizza food
2 hot dog food
3 chips food
4 very amplifier
5 very much amplifier
6 happy pos
7 optimistic pos
df2 <- structure(list(pattern = c("pizza", "hot dog", "chips", "very",
"very much", "happy", "optimistic"), category = c("food", "food",
"food", "amplifier", "amplifier", "pos", "pos")), .Names = c("pattern",
"category"), row.names = c(NA, -7L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
我想创建一个扩展的data.frame,它取df 1并用df 2扩展它,所以它看起来像这样:
pattern cat1 cat2
1 I want pizza a b
2 I want hotdog a b
3 I want chips a b
4 I'm very happy c d
5 I'm much more happy c d
6 I'm very optimistic c d
7 I'm much more optimistic c d
output <- structure(list(pattern = c("I want pizza", "I want hotdog", "I want chips",
"I'm very happy", "I'm much more happy", "I'm very optimistic",
"I'm much more optimistic"), cat1 = c("a", "a", "a", "c", "c",
"c", "c"), cat2 = c("b", "b", "b", "d", "d", "d", "d")), .Names = c("pattern",
"cat1", "cat2"), row.names = c(NA, -7L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
这是我要做的:
library(stringi)
library(data.table)
setDT(df1)
setDT(df2)
capture_patt = "\\[(\\w+)\\]"
df1[, {
cats = stri_match_all(pattern, regex = capture_patt)[[1]][, 2]
new_patt = gsub(capture_patt, "%s", pattern)
subs = do.call(CJ, lapply(cats, function(cat)
df2[.(category = cat), on="category", pattern]
))
.(res = do.call(sprintf, c(.(fmt = new_patt), subs)))
}, by=names(df1)]
# pattern cat1 cat2 res
# 1: I want [food] a b I want chips
# 2: I want [food] a b I want hot dog
# 3: I want [food] a b I want pizza
# 4: I'm [amplifier] [pos]. a b I'm very happy.
# 5: I'm [amplifier] [pos]. a b I'm very optimistic.
# 6: I'm [amplifier] [pos]. a b I'm very much happy.
# 7: I'm [amplifier] [pos]. a b I'm very much optimistic.
Run Code Online (Sandbox Code Playgroud)
它是如何工作的.
对象是......
cats 是我们需要抓取的类别new_patt是模式的sprintf准备版本subs 是必须作出的替换表res 是新专栏更棘手的功能是......
CJ采用交叉产品,就像expand.gridMrFlick的回答一样.do.call(f, list_o_args) 将args列表传递给函数.