R中的字典扩展

Dmi*_*kin 0 regex dictionary r data.table

我正在寻找一种快速有效的扩展字典解决方案(df1)

                 pattern cat1 cat2
1          I want [food]    a    b
2 I'm [amplifier] [pos].    a    b

df1 <- data.frame(pattern=c("I want [food]", "I'm [amplifier] [pos]"),
                      cat1=c("a", "c"), cat2=c("b", "d"), stringsAsFactors=FALSE)
Run Code Online (Sandbox Code Playgroud)

具有字符串模式,其中一些类别包含在方括号[]中.这些表示以字典格式(df2)出现在附加数据框中的类别.

     pattern  category
1      pizza      food
2    hot dog      food
3      chips      food
4       very amplifier
5  very much amplifier
6      happy       pos
7 optimistic       pos

df2 <- structure(list(pattern = c("pizza", "hot dog", "chips", "very", 
"very much", "happy", "optimistic"), category = c("food", "food", 
"food", "amplifier", "amplifier", "pos", "pos")), .Names = c("pattern", 
"category"), row.names = c(NA, -7L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)

我想创建一个扩展的data.frame,它取df 1并用df 2扩展它,所以它看起来像这样:

                   pattern cat1 cat2
1             I want pizza    a    b
2            I want hotdog    a    b
3             I want chips    a    b
4           I'm very happy    c    d
5      I'm much more happy    c    d
6      I'm very optimistic    c    d
7 I'm much more optimistic    c    d

output <- structure(list(pattern = c("I want pizza", "I want hotdog", "I want chips", 
"I'm very happy", "I'm much more happy", "I'm very optimistic", 
"I'm much more optimistic"), cat1 = c("a", "a", "a", "c", "c", 
"c", "c"), cat2 = c("b", "b", "b", "d", "d", "d", "d")), .Names = c("pattern", 
"cat1", "cat2"), row.names = c(NA, -7L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)

Fra*_*ank 7

这是我要做的:

library(stringi)
library(data.table)
setDT(df1)
setDT(df2)

capture_patt = "\\[(\\w+)\\]"
df1[, {
    cats = stri_match_all(pattern, regex = capture_patt)[[1]][, 2]
    new_patt = gsub(capture_patt, "%s", pattern)

    subs = do.call(CJ, lapply(cats, function(cat) 
      df2[.(category = cat), on="category", pattern]
    ))

    .(res = do.call(sprintf, c(.(fmt = new_patt), subs)))
}, by=names(df1)]


#                   pattern cat1 cat2                       res
# 1:          I want [food]    a    b              I want chips
# 2:          I want [food]    a    b            I want hot dog
# 3:          I want [food]    a    b              I want pizza
# 4: I'm [amplifier] [pos].    a    b           I'm very happy.
# 5: I'm [amplifier] [pos].    a    b      I'm very optimistic.
# 6: I'm [amplifier] [pos].    a    b      I'm very much happy.
# 7: I'm [amplifier] [pos].    a    b I'm very much optimistic.
Run Code Online (Sandbox Code Playgroud)

它是如何工作的.

对象是......

  • cats 是我们需要抓取的类别
  • new_patt是模式的sprintf准备版本
  • subs 是必须作出的替换表
  • res 是新专栏

更棘手的功能是......

  • CJ采用交叉产品,就像expand.gridMrFlick的回答一样.
  • do.call(f, list_o_args) 将args列表传递给函数.