Eco*_*tis 2 python mysql merge r dplyr
简短版本:我有一个比平常更复杂的合并操作,我想帮助优化dplyr或合并.我已经有了很多解决方案,但是这些解决方案在大型数据集上运行得非常慢,我很好奇R中是否存在更快的方法(或者在SQL或python中)
我有两个data.frames:
问题:商店ID是特定位置的唯一标识符,但商店位置可能会将所有权从一个时段更改为下一个时段(并且只是为了完整性,没有两个所有者可能同时拥有相同的商店).因此,当我合并商店级别信息时,我需要某种条件,将商店级信息合并到正确的时间段.
可重复的例子:
# asynchronous log.
# t for period.
# Store for store loc ID
# var1 just some variable.
set.seed(1)
df <- data.frame(
t = c(1,1,1,2,2,2,3,3,4,4,4),
Store = c(1,2,3,1,2,3,1,3,1,2,3),
var1 = runif(11,0,1)
)
# Store table
# You can see, lots of store location opening and closing,
# StateDate is when this business came into existence
# Store is the store id from df
# CloseDate is when this store when out of business
# storeVar1 is just some important var to merge over
Stores <- data.frame(
StartDate = c(0,0,0,4,4),
Store = c(1,2,3,2,3),
CloseDate = c(9,2,3,9,9),
storeVar1 = c("a","b","c","d","e")
)
Run Code Online (Sandbox Code Playgroud)
现在,我只想将Storedf中的信息合并到日志中,如果Store在那段时间内开放业务(t).CloseDate并分别StartDate指出该业务运营的最后和第一个时段.(为了完整性但不太重要,StartDate在样品出现之前,商店已存在0.对于CloseDate9,商店在样品结束时没有在该位置停业.)
一种解决方案依赖于一个时期t的水平split()和dplyr::rbind_all(),如
# The following seems to do the trick.
complxMerge_v1 <- function(df, Stores, by = "Store"){
library("dplyr")
temp <- split(df, df$t)
for (Period in names(temp))(
temp[[Period]] <- dplyr::left_join(
temp[[Period]],
dplyr::filter(Stores,
StartDate <= as.numeric(Period) &
CloseDate >= as.numeric(Period)),
by = "Store"
)
)
df <- dplyr::rbind_all(temp); rm(temp)
df
}
complxMerge_v1(df, Stores, "Store")
Run Code Online (Sandbox Code Playgroud)
从功能上看,这似乎有效(无论如何还没有遇到重大错误).但是,我们正在处理(越来越常见的)数十亿行日志数据.
如果你想用它来进行基准测试,我在sense.io上做了一个更大的可重复的例子.见这里:https://sense.io/economicurtis/r-faster-merging-of-two-data.frames-with-row-level-conditionals
两个问题:
在R中,您可以查看该data.table::foverlaps功能
library(data.table)
# Set start and end values in `df` and key by them and by `Store`
setDT(df)[, c("StartDate", "CloseDate") := list(t, t)]
setkey(df, Store, StartDate, CloseDate)
# Run `foverlaps` function
foverlaps(setDT(Stores), df)
# Store t var1 StartDate CloseDate i.StartDate i.CloseDate storeVar1
# 1: 1 1 0.26550866 1 1 0 9 a
# 2: 1 2 0.90820779 2 2 0 9 a
# 3: 1 3 0.94467527 3 3 0 9 a
# 4: 1 4 0.62911404 4 4 0 9 a
# 5: 2 1 0.37212390 1 1 0 2 b
# 6: 2 2 0.20168193 2 2 0 2 b
# 7: 3 1 0.57285336 1 1 0 3 c
# 8: 3 2 0.89838968 2 2 0 3 c
# 9: 3 3 0.66079779 3 3 0 3 c
# 10: 2 4 0.06178627 4 4 4 9 d
# 11: 3 4 0.20597457 4 4 4 9 e
Run Code Online (Sandbox Code Playgroud)