rbindlist两个data.tables,其中一个有因子,另一个有列的字符类型

Aru*_*run 13 r data.table

我刚在剧本中发现了这个警告,有点奇怪.

# Warning message:
# In rbindlist(list(DT.1, DT.2)) : NAs introduced by coercion
Run Code Online (Sandbox Code Playgroud)

观察1:这是一个可重复的例子:

require(data.table)
DT.1 <- data.table(x = letters[1:5], y = 6:10)
DT.2 <- data.table(x = LETTERS[1:5], y = 11:15)

# works fine
rbindlist(list(DT.1, DT.2))
#     x  y
#  1: a  6
#  2: b  7
#  3: c  8
#  4: d  9
#  5: e 10
#  6: A 11
#  7: B 12
#  8: C 13
#  9: D 14
# 10: E 15
Run Code Online (Sandbox Code Playgroud)

但是,现在如果我将列转换xfactor(有序或无)并执行相同的操作:

DT.1[, x := factor(x)]
rbindlist(list(DT.1, DT.2))
#      x  y
#  1:  a  6
#  2:  b  7
#  3:  c  8
#  4:  d  9
#  5:  e 10
#  6: NA 11
#  7: NA 12
#  8: NA 13
#  9: NA 14
# 10: NA 15
# Warning message:
# In rbindlist(list(DT.1, DT.2)) : NAs introduced by coercion
Run Code Online (Sandbox Code Playgroud)

rbind这项工作做得很好!

rbind(DT.1, DT.2) # where DT.1 has column x as factor
# do.call(rbind, list(DT.1, DT.2)) # also works fine
#     x  y
#  1: a  6
#  2: b  7
#  3: c  8
#  4: d  9
#  5: e 10
#  6: A 11
#  7: B 12
#  8: C 13
#  9: D 14
# 10: E 15
Run Code Online (Sandbox Code Playgroud)

如果列x也是同样的,则可以再现相同的行为ordered factor.由于帮助页面显示?rbindlist:Same as do.call("rbind",l), but much faster.,我猜这不是理想的行为?


这是我的会话信息:

# R version 3.0.0 (2013-04-03)
# Platform: x86_64-apple-darwin10.8.0 (64-bit)
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] data.table_1.8.8
# 
# loaded via a namespace (and not attached):
# [1] tools_3.0.0
Run Code Online (Sandbox Code Playgroud)

编辑:

观察2:跟随@ AnandaMahto的另一个有趣的观察,扭转顺序:

# column x in DT.1 is still a factor
rbindlist(list(DT.2, DT.1))
#     x  y
#  1: A 11
#  2: B 12
#  3: C 13
#  4: D 14
#  5: E 15
#  6: 1  6
#  7: 2  7
#  8: 3  8
#  9: 4  9
# 10: 5 10
Run Code Online (Sandbox Code Playgroud)

这里,DT.1列默默地被强制转换为numeric.
编辑:只是为了澄清,这rbind(DT2, DT1)与DT1的列x是一个因素的行为相同.rbind似乎保留了第一个参数的类.我将把这部分留在这里,并提到在这种情况下,这是期望的行为,因为rbindlist是更快的实现rbind.

观察3:如果现在,两个列都转换为因子:

# DT.1 column x is already a factor
DT.2[, x := factor(x)]
rbindlist(list(DT.1, DT.2))
#     x  y
#  1: a  6
#  2: b  7
#  3: c  8
#  4: d  9
#  5: e 10
#  6: a 11
#  7: b 12
#  8: c 13
#  9: d 14
# 10: e 15
Run Code Online (Sandbox Code Playgroud)

这里,柱xDT.2丢失(/与替换DT.1).如果顺序颠倒过来,则会发生完全相反的情况(DT.1获取的列x 被替换为DT.2).

一般来说,处理factor列中似乎存在问题rbindlist.

Ric*_*rta 7

更新 - 此错误(#2650)于2013年5月17日在v1.8.9中修复


我相信,rbindlist当应用于因子时,将组合因子的数值并仅使用与第一列表元素相关联的级别.

如此错误报告:http://r-forge.r-project.org/tracker/index.php? func = detail&aid = 2650&group_id = 240 &atid = 975


# Temporary workaround: 

levs <- c(as.character(DT.1$x), as.character(DT.2$x))

DT.1[, x := factor(x, levels=levs)]
DT.2[, x := factor(x, levels=levs)]

rbindlist(list(DT.1, DT.2))
Run Code Online (Sandbox Code Playgroud)

正如另一种观点:

DT3 <- data.table(x=c("1st", "2nd"), y=1:2)
DT4 <- copy(DT3)

DT3[, x := factor(x, levels=x)]
DT4[, x := factor(x, levels=x, labels=rev(x))]

DT3
DT4

# Have a look at the difference:
rbindlist(list(DT3, DT4))$x
# [1] 1st 2nd 1st 2nd
# Levels: 1st 2nd

do.call(rbind, list(DT3, DT4))$x
# [1] 1st 2nd 2nd 1st
# Levels: 1st 2nd
Run Code Online (Sandbox Code Playgroud)

根据评论编辑:

至于观察1,发生的事情类似于:

x <- factor(LETTERS[1:5])

x[6:10] <- letters[1:5]
x

# Notice however, if you are assigning a value that is already present
x[11] <- "S"  # warning, since `S` is not one of the levels of x
x[12] <- "D"  # all good, since `D` *is* one of the levels of x
Run Code Online (Sandbox Code Playgroud)