警告信息:在rbindlist(allargs)中:强制引入的NA:data.table中可能出现的错误?

Aru*_*run 9 r data.table

在分析一些数据时,我遇到了警告信息,我怀疑这是一个错误,因为它是一个非常简单的命令,我曾多次使用过.

Warning message:
In rbindlist(allargs) : NAs introduced by coercion
Run Code Online (Sandbox Code Playgroud)

我能够重现错误.这是一个代码,您应该能够重现错误.

# unique random names for column V1
set.seed(45)
n <- sapply(1:500, function(x) {
    paste(sample(c(letters[1:26]), 10), collapse="")
})
# generate some values for V2 and V3
dt <- data.table(V1 = sample(n, 30*500, replace = TRUE), 
                 V2 = sample(1:10, 30*500, replace = TRUE), 
                 V3 = sample(50:100, 30*500, replace = TRUE))
setkey(dt, "V1")

# No warning when providing column names (and right results)
dt[, list(s = sum(V2), m = mean(V3)),by=V1]

#              V1   s        m
#   1: acgmqyuwpe 238 74.97778
#   2: adcltygwsq 204 79.94118
#   3: adftozibnh 165 75.51515
#   4: aeuowtlskr 164 75.70968
#   5: ahfoqclkpg 192 73.20000
#  ---                        
# 496: zuqegoxkpi  93 77.95000
# 497: zwpserimgf 178 72.62963
# 498: zxkpdrlcsf 154 78.04167
# 499: zxvoaeflhq 121 75.34615
# 500: zyiwcsanlm 180 76.61290

# Warning message and results with NA
dt[, list(sum(V2), mean(V3)),by=V1]

#              V1  V1       V2
#   1: acgmqyuwpe 238 74.97778
#   2: adcltygwsq 204 79.94118
#   3: adftozibnh 165 75.51515
#   4: aeuowtlskr 164 75.70968
#   5: ahfoqclkpg 192 73.20000
#  ---                        
# 496: zuqegoxkpi  NA 77.95000
# 497: zwpserimgf  NA 72.62963
# 498: zxkpdrlcsf  NA 78.04167
# 499: zxvoaeflhq  NA 75.34615
# 500: zyiwcsanlm  NA 76.61290

Warning message:
In rbindlist(allargs) : NAs introduced by coercion
Run Code Online (Sandbox Code Playgroud)
  • 1)如果您不提供列名,似乎会发生这种情况.

  • 2)即便如此,特别是当V1(或者你使用的列by=)有很多unique条目(这里是500)并且你没有指定列名时,这似乎就发生了.也就是说,当列具有较少的唯一条目时,不会发生这种情况.例如,请尝试更改代码从到,你会得到任何警告.by=V1nsapply(1:500, ...sapply(1:50, ...

这里发生了什么?它的R版本2.15在Macbook pro上有OS X 10.8.2(虽然我在2.15.2的另一个macbook pro上进行了测试).这是sessionInfo().

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.8.6 reshape2_1.2.2  

loaded via a namespace (and not attached):
[1] plyr_1.8      stringr_0.6.2 tools_2.15.0 
Run Code Online (Sandbox Code Playgroud)

刚刚转载2.15.2:

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.8.6
Run Code Online (Sandbox Code Playgroud)

Mat*_*wle 7

更新:现在由里卡多修复于v1.8.9

o rbind'ing data.tables包含重复的,""或NA列名称现在可用,#2726ॐ.感谢Garrett See和Arun Srinivasan的报道.这也影响了具有重复列名的data.tables的打印,因为head和tail在内部一起被rbind-ed.


是的,错误.似乎是在data.table具有重复名称的s 的print方法中.

ans = dt[, list(sum(V2), mean(V3)),by=V1]
head(ans)
           V1  V1       V2     # notice the duplicated V1
1: acgmqyuwpe 140 78.07692
2: adcltygwsq 191 76.93333
3: adftozibnh 153 73.82143
4: aeuowtlskr 122 73.04348
5: ahfoqclkpg 143 75.83333
6: ahtczyuipw 135 73.54167
tail(ans)
           V1  V1       V2
1: zugrnehpmq 189 72.63889
2: zuqegoxkpi 150 76.03333
3: zwpserimgf 180 74.81818
4: zxkpdrlcsf 115 72.57895
5: zxvoaeflhq 157 76.53571
6: zyiwcsanlm 145 72.79167
print(ans)
Error in rbindlist(allargs) : 
    (converted from warning) NAs introduced by coercion
rbind(head(ans),tail(ans))
Error in rbindlist(allargs) : 
    (converted from warning) NAs introduced by coercion
Run Code Online (Sandbox Code Playgroud)

作为变通,不要创建data.table列名V1,V2等等.

它是由于这个已知的错误而产生的:

#2384包含重复列名的表的rbind无法正确绑定

我在这个问题上添加了一个链接.

谢谢!

  • @Arun因为只有当`dt`超过100行(默认情况下)时,`print(dt)`将`rbind`'打印在顶部和底部,将`head`和`tail`打印在一起. (3认同)