Ale*_*lex 5 sorting r data.table
当使用有序因子rbind处理两个data.table时,顺序似乎丢失了:
dtb1 = data.table(id = factor(c("a", "b"), levels = c("a", "c", "b"), ordered=T), key="id")
dtb2 = data.table(id = factor(c("c"), levels = c("a", "c", "b"), ordered=T), key="id")
test = rbind(dtb1, dtb2)
is.ordered(test$id)
#[1] FALSE
Run Code Online (Sandbox Code Playgroud)
有什么想法或想法吗?
data.table做一些花哨的步法,这意味着在对象上data.table:::.rbind.data.table调用时会rbind被调用data.tables..rbind.data.table利用与之相关的加速比rbindlist,通过一些额外的检查来匹配名称等.
.rbind.data.table通过使用c它们来处理因子列(因此保留了levels属性)
# the relevant code is
l = lapply(seq_along(allargs[[1L]]), function(i) do.call("c",
lapply(allargs, "[[", i)))
Run Code Online (Sandbox Code Playgroud)
在以这种方式base R使用c时不保留"有序"属性,它甚至不返回一个因子!
例如(in base R)
f <- factor(1:2, levels = 2:1, ordered=TRUE)
g <- factor(1:2, levels = 2:1, ordered=TRUE)
# it isn't ordered!
is.ordered(c(f,g))
# [1] FALSE
# no suprise as it isn't even a factor!
is.factor(c(f,g))
# [1] FALSE
Run Code Online (Sandbox Code Playgroud)
但是data.table有一个S3方法c.factor,用于确保返回一个因子并保留级别.不幸的是,此方法不保留有序属性.
getAnywhere('c.factor')
# A single object matching ‘c.factor’ was found
# It was found in the following places
# namespace:data.table
# with value
#
# function (...)
# {
# args <- list(...)
# for (i in seq_along(args)) if (!is.factor(args[[i]]))
# args[[i]] = as.factor(args[[i]])
# newlevels = unique(unlist(lapply(args, levels), recursive = TRUE,
# use.names = TRUE))
# ind <- fastorder(list(newlevels))
# newlevels <- newlevels[ind]
# nm <- names(unlist(args, recursive = TRUE, use.names = TRUE))
# ans = unlist(lapply(args, function(x) {
# m = match(levels(x), newlevels)
# m[as.integer(x)]
# }))
structure(ans, levels = newlevels, names = nm, class = "factor")
}
<bytecode: 0x073f7f70>
<environment: namespace:data.table
Run Code Online (Sandbox Code Playgroud)
所以是的,这是一个错误.现在报告为#5019.
| 归档时间: |
|
| 查看次数: |
819 次 |
| 最近记录: |