我有一个示例数据帧
df <- data.frame(cust = sample(1:100, 1000, TRUE),
channel = sample(c("WEB", "POS"), 1000, TRUE))
Run Code Online (Sandbox Code Playgroud)
我正试图改变
get_channels <- function(data) {
d <- data
if(unique(d) %>% length() == 2){
d <- "Both"
} else {
if(unique(d) %>% length() < 2 && unique(d) == "WEB") {
d <- "Web"
} else {
d <- "POS"
}
}
return(d)
}
Run Code Online (Sandbox Code Playgroud)
这没有问题,在小型数据帧上,它根本不需要时间.
start.time <- Sys.time()
df %>%
group_by(cust) %>%
mutate(chan = get_channels(channel)) %>%
group_by(cust) %>%
slice(1) %>%
group_by(chan) %>%
summarize(count = n()) %>%
mutate(perc = count/sum(count))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Run Code Online (Sandbox Code Playgroud)
时差0.34602秒
但是,当数据帧变得相当大时,例如,大于> 1000000或更多cust,我的基本if/elsefx需要更长时间.
如何简化此功能以使其更快地运行?
你应该为此使用data.table.
setDT(df)
t1 = Sys.time()
df = df[ , .(channels = ifelse(uniqueN(channel) == 2, "both", as.character(channel[1]))), by = .(cust)]
> Sys.time() - t1
Time difference of 0.00500083 secs
> head(df)
cust channels
1: 37 both
2: 45 both
3: 74 both
4: 20 both
5: 1 both
6: 68 both
Run Code Online (Sandbox Code Playgroud)