用于多个类别的ifelse样式重新编码的习语

Ari*_*man 17 r recode r-factor

我经常遇到这种情况,我认为必须有一个很好的成语.假设我有一个包含一系列属性的data.frame,包括"product".我还有一把钥匙,可以将产品转化为品牌+尺寸.产品代码1-3是Tylenol,4-6是Advil,7-9是拜耳,10-12是Generic.

什么是最快的(就人类时间而言)编码方式?

ifelse如果有3个或更少的类别,我倾向于使用嵌套的;如果有超过3个类型,则键入数据表并将其合并.任何更好的想法?Stata有一个非常漂亮的recode命令,虽然我相信它会促进数据代码混合有点过分.

dat <- structure(list(product = c(11L, 11L, 9L, 9L, 6L, 1L, 11L, 5L, 
7L, 11L, 5L, 11L, 4L, 3L, 10L, 7L, 10L, 5L, 9L, 8L)), .Names = "product", row.names = c(NA, 
-20L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)

Mar*_*rek 19

您可以将变量转换为因子并按levels<-功能更改其级别.在一个命令中它可能像:

`levels<-`(
    factor(dat$product),
    list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)
)
Run Code Online (Sandbox Code Playgroud)

步骤:

brands <- factor(dat$product)
levels(brands) <- list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)
Run Code Online (Sandbox Code Playgroud)

  • 好快捷!我在这里找到了解释:[link](http://stackoverflow.com/q/10449366/1460352) (2认同)

huo*_*uon 14

可以使用列表作为关联数组来定义brand -> product code映射,即:

brands <- list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)
Run Code Online (Sandbox Code Playgroud)

一旦你有了这个,你可以反过来创建一个product code -> brand列表(可能会占用大量内存),或者只是使用搜索功能:

find.key <- function(x, li, default=NA) {
    ret <- rep.int(default, length(x))
    for (key in names(li)) {
        ret[x %in% li[[key]]] <- key
    }
    return(ret)
}
Run Code Online (Sandbox Code Playgroud)

我确信有更好的方法来编写这个函数(for循环让我烦恼!),但至少它是矢量化的,所以它只需要一次通过列表.

使用它将是这样的:

> dat$brand <- find.key(dat$product, brands)
> dat
   product   brand
1       11 Generic
2       11 Generic
3        9   Bayer
4        9   Bayer
5        6   Advil
6        1 Tylenol
7       11 Generic
8        5   Advil
9        7   Bayer
10      11 Generic
11       5   Advil
12      11 Generic
13       4   Advil
14       3 Tylenol
15      10 Generic
16       7   Bayer
17      10 Generic
18       5   Advil
19       9   Bayer
20       8   Bayer
Run Code Online (Sandbox Code Playgroud)

recodelevels<-解决方案是非常好的,但他们也不止这一个显著慢(一旦你有find.key这种情况很容易换比人类recode和与票面levels<-):

> microbenchmark(
     recode=recode(dat$product,recodes="1:3='Tylenol';4:6='Advil';7:9='Bayer';10:12='Generic'"), 
     find.key=find.key(dat$product, brands),
     levels=`levels<-`(factor(dat$product),brands))
Unit: microseconds
      expr      min        lq    median        uq      max
1 find.key   64.325   69.9815   76.8950   83.8445  221.748
2   levels  240.535  248.1470  274.7565  306.8490 1477.707
3   recode 1636.039 1683.4275 1730.8170 1855.8320 3095.938
Run Code Online (Sandbox Code Playgroud)

(我无法让switch版本正确地进行基准测试,但它似乎比上述所有版本更快,尽管对于人类而言比recode解决方案更糟糕.)


Ben*_*nes 12

我喜欢包中的recode功能car:

library(car)

dat$brand <- recode(dat$product,
  recodes="1:3='Tylenol';4:6='Advil';7:9='Bayer';10:12='Generic'")

# > dat
#    product   brand
# 1       11 Generic
# 2       11 Generic
# 3        9   Bayer
# 4        9   Bayer
# 5        6   Advil
# 6        1 Tylenol
# 7       11 Generic
# 8        5   Advil
# 9        7   Bayer
# 10      11 Generic
# 11       5   Advil
# 12      11 Generic
# 13       4   Advil
# 14       3 Tylenol
# 15      10 Generic
# 16       7   Bayer
# 17      10 Generic
# 18       5   Advil
# 19       9   Bayer
# 20       8   Bayer
Run Code Online (Sandbox Code Playgroud)

  • `recode`的唯一问题是它通过处理字符串来工作,所以如果你的代码/数据碰巧有分号和=符号,那就很麻烦...... (8认同)

koh*_*ske 8

我经常使用以下技术:

key <- c()
key[1:3] <- "Tylenol"
key[4:6] <- "Advil"
key[7:9] <- "Bayer"
key[10:12] <- "Generic"
Run Code Online (Sandbox Code Playgroud)

然后,

> key[dat$product]
 [1] "Generic" "Generic" "Bayer"   "Bayer"   "Advil"   "Tylenol" "Generic" "Advil"   "Bayer"   "Generic"
[11] "Advil"   "Generic" "Advil"   "Tylenol" "Generic" "Bayer"   "Generic" "Advil"   "Bayer"   "Bayer"  
Run Code Online (Sandbox Code Playgroud)


flo*_*del 7

"数据库方法"是为产品密钥定义保留单独的表(data.frame).它更有意义,因为你说你的产品键不仅转化为品牌,还转化为大小:

product.keys <- read.table(textConnection("

product brand   size
1       Tylenol small
2       Tylenol medium
3       Tylenol large
4       Advil   small
5       Advil   medium
6       Advil   large
7       Bayer   small
8       Bayer   medium
9       Bayer   large
10      Generic small
11      Generic medium
12      Generic large

"), header = TRUE)
Run Code Online (Sandbox Code Playgroud)

然后,您可以使用merge以下方式加入数据:

merge(dat, product.keys, by = "product")
#    product   brand   size
# 1        1 Tylenol  small
# 2        3 Tylenol  large
# 3        4   Advil  small
# 4        5   Advil medium
# 5        5   Advil medium
# 6        5   Advil medium
# 7        6   Advil  large
# 8        7   Bayer  small
# 9        7   Bayer  small
# 10       8   Bayer medium
# 11       9   Bayer  large
# 12       9   Bayer  large
# 13       9   Bayer  large
# 14      10 Generic  small
# 15      10 Generic  small
# 16      11 Generic medium
# 17      11 Generic medium
# 18      11 Generic medium
# 19      11 Generic medium
# 20      11 Generic medium
Run Code Online (Sandbox Code Playgroud)

如您所知,行的顺序不会被保留merge.如果这是一个问题,该plyr包有一个join确保订单的功能:

library(plyr)
join(dat, product.keys, by = "product")
#    product   brand   size
# 1       11 Generic medium
# 2       11 Generic medium
# 3        9   Bayer  large
# 4        9   Bayer  large
# 5        6   Advil  large
# 6        1 Tylenol  small
# 7       11 Generic medium
# 8        5   Advil medium
# 9        7   Bayer  small
# 10      11 Generic medium
# 11       5   Advil medium
# 12      11 Generic medium
# 13       4   Advil  small
# 14       3 Tylenol  large
# 15      10 Generic  small
# 16       7   Bayer  small
# 17      10 Generic  small
# 18       5   Advil medium
# 19       9   Bayer  large
# 20       8   Bayer medium
Run Code Online (Sandbox Code Playgroud)

最后,如果您的表很大并且速度是个问题,请考虑使用data.tables(来自data.table包)而不是data.frames.


Tyl*_*ker 6

这个需要一些打字,但如果你真的有一个庞大的数据集,这可能是要走的路.在talkstats.com上的Bryangoodrich和Dason教会了我这个.它使用哈希表或创建包含查找表的环境.我实际上将这个保留在我的.Rprofile(哈希函数)上,用于字典类型查找.

我将您的数据复制1000次以使其更大一些.

#################################################
# THE HASH FUNCTION (CREATES A ENW ENVIRONMENT) #
#################################################
hash <- function(x, type = "character") {
    e <- new.env(hash = TRUE, size = nrow(x), parent = emptyenv())
    char <- function(col) assign(col[1], as.character(col[2]), envir = e)
    num <- function(col) assign(col[1], as.numeric(col[2]), envir = e)
    FUN <- if(type=="character") char else num
    apply(x, 1, FUN)
    return(e)
}
###################################
# YOUR DATA REPLICATED 1000 TIMES #
###################################
dat <- dat <- structure(list(product = c(11L, 11L, 9L, 9L, 6L, 1L, 11L, 5L, 
    7L, 11L, 5L, 11L, 4L, 3L, 10L, 7L, 10L, 5L, 9L, 8L)), .Names = "product", row.names = c(NA, 
    -20L), class = "data.frame")
dat <- dat[rep(seq_len(nrow(dat)), 1000), , drop=FALSE]
rownames(dat) <-NULL
dat
#########################
# CREATE A LOOKUP TABLE #
#########################
med.lookup <- data.frame(val=as.character(1:12), 
    med=rep(c('Tylenol', 'Advil', 'Bayer', 'Generic'), each=3))  

########################################
# USE hash TO CREATE A ENW ENVIRONMENT #
########################################  
meds <- hash(med.lookup)  

##############################
# CREATE A RECODING FUNCTION #
##############################          
recoder <- function(x){
    x <- as.character(x) #turn the numbers to character
    rc <- function(x){
       if(exists(x, env = meds))get(x, e = meds) else NA 
    }  
    sapply(x, rc, USE.NAMES = FALSE) 
}
#############
# HASH AWAY #
#############
recoder(dat[, 1])    
Run Code Online (Sandbox Code Playgroud)

在这种情况下,散列很慢但是如果你有更多的级别来重新编码,那么它将比其他级别更快.