使用data.tablein R,我试图对除了所选元素之外的子集进行操作.我正在使用by运营商,但我不知道这是否是正确的方法.
这是一个例子.例如对于值Delta在IAH:SNA为(3 + 3)/ 2,其是平均Stops在IAH:SNA一旦Delta已被排除.
library(data.table)
s1 <- "Market Carrier Stops
IAH:SNA Delta 1
IAH:SNA Delta 1
IAH:SNA Southwest 3
IAH:SNA Southwest 3
MSP:CLE Southwest 2
MSP:CLE Southwest 2
MSP:CLE American 2
MSP:CLE JetBlue 1"
d <- data.table(read.table(textConnection(s1), header=TRUE))
setkey(d, Carrier, Market)
f <- function(x, y){
subset(d, !(Carrier %in% x) & Market == y, Stops)[, mean(Stops)]}
d[, s := f(.BY[[1]], .BY[[2]]), by=list(Carrier, Market)]
## Market Carrier Stops s
## 1: MSP:CLE American 2 1.666667
## 2: IAH:SNA Delta 1 3.000000
## 3: IAH:SNA Delta 1 3.000000
## 5: IAH:SNA Southwest 3 1.000000
## 6: IAH:SNA Southwest 3 1.000000
## 7: MSP:CLE Southwest 2 1.500000
## 8: MSP:CLE Southwest 2 1.500000
Run Code Online (Sandbox Code Playgroud)
上面的解决方案在大型数据集上表现非常糟糕(它本质上是一个mapply),但我不确定如何以快速data.table的方式进行.
也许人们可以(动态地)生成一个这样做的因素?我只是不确定如何...
有没有办法改善它?
编辑:只是为了它,这是一种方法来获得更大的上述版本
library(data.table)
dl.dta <- function(...){
## input years ..
years <- gsub("\\.", "_", c(...))
baseurl <- "http://www.transtats.bts.gov/Download/"
names <- paste("Origin_and_Destination_Survey_DB1BMarket", years, sep="_")
info <- t(sapply(names, function(x) file.exists(paste(x, c("zip", "csv"), sep="."))))
to.download <- paste(baseurl, names, ".zip", sep="")[!apply(info, 1, any)]
if (length(to.download) > 0){
message("starting download...")
sapply(to.download,
function(x) download.file(x, rev(strsplit(x, "/")[[1]])[1]))}
to.unzip <- paste(names, "zip", sep=".")[!info[, 2]]
if (length(to.unzip > 0)){
message("starting to unzip...")
sapply(to.unzip, unzip)}
paste(names, "csv", sep=".")}
countWords.split <- function(x, s=":"){
## Faster on my machine than grep for some reanon
sapply(strsplit(as.character(x), s), length)}
countWords.grep <- function(x){
sapply(gregexpr("\\W+", x), length)+1}
fname <- dl.dta(2013.1)
cols <- rep("NULL", 41)
## Columns to keep: 9 is Origin, 18 is Dest, 24 is groups of airports in travel
## 30 is RPcarrier (reporting carrier).
## For more columns: 35 is market fare and 36 is distance.
cols[9] <- cols[18] <- cols[24] <- cols[30] <- NA
d <- data.table(read.csv(file=fname, colClasses=cols))
d[, Market := paste(Origin, Dest, sep=":")]
## should probably
d[, Stops := -2 + countWords.split(AirportGroup)]
d[, Carrier := RPCarrier]
d[, c("RPCarrier", "Origin", "Dest", "AirportGroup") := NULL]
Run Code Online (Sandbox Code Playgroud)
@Roland的答案适用于某些功能(当它适用时它将是最好的),但不适用于一般功能。不幸的是,您无法像执行任务那样对数据应用拆分-应用-组合策略,但如果您使数据更大,则可以。让我们从一个更简单的例子开始:
dt = data.table(a = c(1,1,2,2,3,3), b = c(1:6), key = 'a')
# now let's extend this table the following way
# take the unique a's and construct all the combinations excluding one element
combinations = dt[, combn(unique(a), 2)]
# now combine this into a data.table with the excluded element as the index
# and merge it back into the original data.table
extension = rbindlist(apply(combinations, 2,
function(x) data.table(a = x, index = setdiff(c(1,2,3), x))))
setkey(extension, a)
dt.extended = extension[dt, allow.cartesian = TRUE]
dt.extended[order(index)]
# a index b
# 1: 2 1 3
# 2: 2 1 4
# 3: 3 1 5
# 4: 3 1 6
# 5: 1 2 1
# 6: 1 2 2
# 7: 3 2 5
# 8: 3 2 6
# 9: 1 3 1
#10: 1 3 2
#11: 2 3 3
#12: 2 3 4
# Now we have everything we need:
dt.extended[, mean(b), by = list(a = index)]
# a V1
#1: 3 2.5
#2: 2 3.5
#3: 1 4.5
Run Code Online (Sandbox Code Playgroud)
回到原始数据(并稍微不同地执行一些操作,以简化表达式):
extension = d[, {Carrier.uniq = unique(Carrier);
.SD[, rbindlist(combn(Carrier.uniq, length(Carrier.uniq)-1,
function(x) data.table(Carrier = x,
index = setdiff(Carrier.uniq, x)),
simplify = FALSE))]}, by = Market]
setkey(extension, Market, Carrier)
extension[d, allow.cartesian = TRUE][, mean(Stops), by = list(Market, Carrier = index)]
# Market Carrier V1
#1: IAH:SNA Southwest 1.000000
#2: IAH:SNA Delta 3.000000
#3: MSP:CLE JetBlue 2.000000
#4: MSP:CLE Southwest 1.500000
#5: MSP:CLE American 1.666667
Run Code Online (Sandbox Code Playgroud)