在R中,使用dplyr的mutate()创建一个以另一个的内容为条件的新变量

M. *_*ott 3 r

我想搜索一个变量的内容placement,并term根据所寻找的模式创建一个新变量.一个最小的例子......

首先,我创建一个搜索模式函数:

calcterm <- function(x){    # calcterm takes a column argument to read
    print(x)
    if (x %in% '_fa_') {
            return ('fall')
    } else if (x %in% '_wi_') {
            return('winter')
    } else if (x %in% '_sp_') {
            return('spring')
    } else {return('summer')
    }
}
Run Code Online (Sandbox Code Playgroud)

我将创建一个小数据框,然后我将传递给dplyr tbl_df:

placement <- c('pn_ds_ms_fa_th_hrs','pn_ds_ms_wi_th_hrs' ,'pn_ds_ms_wi_th_hrs')
hours <- c(1230, NA, 34)

d <- data.frame(placement, hours)

library(dplyr)

d <- tbl_df(d)
Run Code Online (Sandbox Code Playgroud)

表d现在显示为:

>d
    Source: local data frame [3 x 2]

       placement hours
          (fctr) (dbl)
1 pn_ds_ms_fa_th_hrs  1230
2 pn_ds_ms_wi_th_hrs    NA
3 pn_ds_ms_wi_th_hrs    34
Run Code Online (Sandbox Code Playgroud)

接下来,我使用mutate来实现我的功能.我们的目标是要读取的内容placement,并创建一个新的变量,这将导致在任一的值fall,winter,spring,或summer取决于所涉及的内找到的模式placement列.

d %>% mutate(term=calcterm(placement))
Run Code Online (Sandbox Code Playgroud)

输出让我失望

[1] pn_ds_ms_fa_th_hrs pn_ds_ms_wi_th_hrs pn_ds_ms_wi_th_hrs
Levels: pn_ds_ms_fa_th_hrs pn_ds_ms_wi_th_hrs
Source: local data frame [3 x 3]

       placement hours   term
          (fctr) (dbl)  (chr)
1 pn_ds_ms_fa_th_hrs  1230 summer
2 pn_ds_ms_wi_th_hrs    NA summer
3 pn_ds_ms_wi_th_hrs    34 summer

Warning messages:
    1: In if (x %in% "_fa_") { :
      the condition has length > 1 and only the first element will be used
    2: In if (x %in% "_wi_") { :
      the condition has length > 1 and only the first element will be used
    3: In if (x %in% "_sp_") { :
      the condition has length > 1 and only the first element will be used
Run Code Online (Sandbox Code Playgroud)

所以,显然我在一开始就写错了...也许%in%可以换成grep模式?我不知道如何处理.

谢谢.

UPDATE

根据下面的回复,我用我的全系列管道来更新这个,以显示我是如何实现这一点的.我正在使用的数据是"宽"的,我首先只是翻转它的轴,并从组合名中提取有用的信息.这个例子有效 - 但是在我自己的数据中,当我进入mutate()步骤时,我收到的消息是:Error: invalid subscript type 'list'

值得注意的是,在总结()之后我收到了警告:

Warning message:
attributes are not identical across measure variables; they will be dropped  
Run Code Online (Sandbox Code Playgroud)

也许这与下一步的失败有关?既然警告没有出现在我的例子中?

set.seed(1) 

dfmaker <- function() {
        setNames(
                data.frame(
                        replicate(5, sample(c(NA, 300:500), 4, TRUE), FALSE)), 
                c('pn_ds_ms_fa_th_hrs','rn_ds_ms_wi_th_stu' ,'adn_ds_ms_wi_th_hrs','pn_ds_ms_wi_th_hrs' ,'rn_bsn_ds_ms_wi_th_hrs'))
}


d <- dfmaker()

library(dplyr)

d <- tbl_df(d)

grepl_vec_pattern = Vectorize(grepl, 'pattern')

calcterm = function(s) {
        require(pryr)
        s = as.character(s)
        grepped_patterns = grepl_vec_pattern(s, pattern = c('_sp', '_su', '_fa', '_wi'))
        stopifnot(any(rowSums(grepped_patterns) == 1))   # Ensure that there is exactly one match
        reduce_to_colname_with_true = apply(grepped_patterns, 1, compose(names, which))
        lut_table = c('_sp' = 'spring', '_su' = 'summer', '_fa' = 'fall', '_wi' = 'winter')
        lut_table[reduce_to_colname_with_true]
}

select(d, matches("^pn_|^adn_|^bsn_"), -starts_with("rn_bsn")) %>%  # all the pn, adn, bsn programs, for all information 
        select(contains("_hrs") ) %>%   # takes out just the hours
        gather(placement, hours) %>%  # flip it!
        group_by(placement) %>%  # gather all the schools into a single observation (replicated placement values at this point)
        summarise(sumHours = sum(hours, na.rm=T)) %>%
        mutate(term = calcterm(placement))
Run Code Online (Sandbox Code Playgroud)

Dav*_*urg 5

一种简单和非常有效的方式可能是创建一个简单的查找/模式矢量,然后合并(在非常有效的)stringi::stri_detect_fixeddata.table.即使对于大型数据集,此解决方案也应该非常好地扩展

library(stringi)
library(data.table)
Lookup <- c("fall", "winter", "spring")
Patterns <- c("fa", "wi", "sp")
setDT(d)[, term := Lookup[stri_detect_fixed(placement, Patterns)], by = placement]
d[is.na(term), term := "summer"]
d
#             placement hours   term
# 1: pn_ds_ms_fa_th_hrs  1230   fall
# 2: pn_ds_ms_wi_th_hrs    NA winter
# 3: pn_ds_ms_wi_th_hrs    34 winter
Run Code Online (Sandbox Code Playgroud)

如果我们坚持dplyr,我们将需要创建一个辅助函数来处理未找到匹配的情况(data.table自动进行的事情)

f <- function(x, Lookup, Patterns) {
  temp <- Lookup[stri_detect_fixed(x[1L], Patterns)]
  if(!length(temp)) return("summer")
  temp
}

d %>%
  group_by(placement) %>%
  mutate(term = f(placement, Lookup, Patterns))

# Source: local data frame [3 x 3]
# Groups: placement [2]
# 
#           placement hours   term
#               (fctr) (dbl)  (chr)
# 1 pn_ds_ms_fa_th_hrs  1230   fall
# 2 pn_ds_ms_wi_th_hrs    NA winter
# 3 pn_ds_ms_wi_th_hrs    34 winter
Run Code Online (Sandbox Code Playgroud)

  • @PaulHiemstra足够公平,尽管按照提到的讨论中指向我的Meta回答的链接,如果不需要它们,我将不使用软件包。在这种情况下,由于使用了“ by”操作,因此我使用了“ data.table”,而没有使用“ dplyr”,因为“ mutate”无法处理“不匹配”,并且总是需要输入。 (2认同)