tos*_*pig 10 r data.table
我希望用它data.table
来提高给定函数的速度,但我不确定我是以正确的方式实现它:
数据
鉴于两个data.table
s(dt
和dt_lookup
)
library(data.table)
set.seed(1234)
t <- seq(1,100); l <- letters; la <- letters[1:13]; lb <- letters[14:26]
n <- 10000
dt <- data.table(id=seq(1:n),
thisTime=sample(t, n, replace=TRUE),
thisLocation=sample(la,n,replace=TRUE),
finalLocation=sample(lb,n,replace=TRUE))
setkey(dt, thisLocation)
set.seed(4321)
dt_lookup <- data.table(lkpId = paste0("l-",seq(1,1000)),
lkpTime=sample(t, 10000, replace=TRUE),
lkpLocation=sample(l, 10000, replace=TRUE))
## NOTE: lkpId is purposly recycled
setkey(dt_lookup, lkpLocation)
Run Code Online (Sandbox Code Playgroud)
我有找到的函数lkpId
同时包含thisLocation
和finalLocation
,并具有"最近" lkpTime
(即最小的非负的值thisTime - lkpTime
)
功能
## function to get the 'next' lkpId (i.e. the lkpId with both thisLocation and finalLocation,
## with the minimum non-negative time between thisTime and dt_lookup$lkpTime)
getId <- function(thisTime, thisLocation, finalLocation){
## filter lookup based on thisLocation and finalLocation,
## and only return values where the lkpId has both 'this' and 'final' locations
tempThis <- unique(dt_lookup[lkpLocation == thisLocation,lkpId])
tempFinal <- unique(dt_lookup[lkpLocation == finalLocation,lkpId])
availServices <- tempThis[tempThis %in% tempFinal]
tempThisFinal <- dt_lookup[lkpId %in% availServices & lkpLocation==thisLocation, .(lkpId, lkpTime)]
## calcualte time difference between 'thisTime' and 'lkpTime' (from thisLocation)
temp2 <- thisTime - tempThisFinal$lkpTime
## take the lkpId with the minimum non-negative difference
selectedId <- tempThisFinal[min(which(temp2==min(temp2[temp2>0]))),lkpId]
selectedId
}
Run Code Online (Sandbox Code Playgroud)
尝试解决方案
我需要得到lkpId
每一行dt
.因此,我最初的本能是使用一个*apply
函数,但是时间太长(对我来说)n/nrow > 1,000,000
.所以我试图实现一个data.table
解决方案,看看它是否更快:
selectedId <- dt[,.(lkpId = getId(thisTime, thisLocation, finalLocation)),by=id]
Run Code Online (Sandbox Code Playgroud)
但是,我相当新data.table
,并且这种方法似乎没有提供任何性能提升*apply
解决方案:
lkpIds <- apply(dt, 1, function(x){
thisLocation <- as.character(x[["thisLocation"]])
finalLocation <- as.character(x[["finalLocation"]])
thisTime <- as.numeric(x[["thisTime"]])
myId <- getId(thisTime, thisLocation, finalLocation)
})
Run Code Online (Sandbox Code Playgroud)
对于n = 10,000,两者都需要约30秒.
题
有没有更好的方法data.table
用于getId
在每一行上应用函数dt
?
2015年8月12日更新
感谢来自@eddi的指针,我重新设计了我的整个算法,并且正在使用滚动连接(一个很好的介绍),从而正确使用data.table
.我稍后会写一个答案.
自从提出这个问题以来,我花了时间研究必须提供什么data.table
,data.table
通过@eddi的指针研究连接(例如data.table 上的滚动连接和不等式的内部连接),我想出了一个解决方案。
其中一个棘手的部分是摆脱“对每一行应用函数”的想法,并重新设计解决方案以使用联接。
而且,毫无疑问会有更好的编程方法,但这是我的尝试。
## want to find a lkpId for each id, that has the minimum difference between 'thisTime' and 'lkpTime'
## and where the lkpId contains both 'thisLocation' and 'finalLocation'
## find all lookup id's where 'thisLocation' matches 'lookupLocation'
## and where thisTime - lkpTime > 0
setkey(dt, thisLocation)
setkey(dt_lookup, lkpLocation)
dt_this <- dt[dt_lookup, {
idx = thisTime - i.lkpTime > 0
.(id = id[idx],
lkpId = i.lkpId,
thisTime = thisTime[idx],
lkpTime = i.lkpTime)
},
by=.EACHI]
## remove NAs
dt_this <- dt_this[complete.cases(dt_this)]
## find all matching 'finalLocation' and 'lookupLocaiton'
setkey(dt, finalLocation)
## inner join (and only return the id columns)
dt_final <- dt[dt_lookup, nomatch=0, allow.cartesian=TRUE][,.(id, lkpId)]
## join dt_this to dt_final (as lkpId must have both 'thisLocation' and 'finalLocation')
setkey(dt_this, id, lkpId)
setkey(dt_final, id, lkpId)
dt_join <- dt_this[dt_final, nomatch=0]
## take the combination with the minimum difference between 'thisTime' and 'lkpTime'
dt_join[,timeDiff := thisTime - lkpTime]
dt_join <- dt_join[ dt_join[order(timeDiff), .I[1], by=id]$V1]
## equivalent dplyr code
# library(dplyr)
# dt_this <- dt_this %>%
# group_by(id) %>%
# arrange(timeDiff) %>%
# slice(1) %>%
# ungroup
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
2080 次 |
最近记录: |