Osc*_*ith 3 performance r plyr data.table
我正在尝试使用ddply清理数据,但它在1.3M行上运行速度非常慢.
示例代码:
#Create Sample Data Frame
num_rows <- 10000
df <- data.frame(id=sample(1:20, num_rows, replace=T),
Consumption=sample(-20:20, num_rows, replace=T),
StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
#df <- df[order(df$id, df$StartDate, df$Consumption),]
#Are values negative?
# Needed for subsetting in ddply rows with same positive and negative values
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)
Run Code Online (Sandbox Code Playgroud)
我编写了一个函数来删除行,其中一行中的消耗值相同但对另一行中的消耗值为负(对于相同的id).
#Remove rows from a data frame where there is an equal but opposite consumption value
#Should ensure only one negative value is removed for each positive one.
clean_negatives <- function(x3){
copies <- abs(sum(x3$Neg))
sgn <- ifelse(sum(x3$Neg) <0, -1, 1)
x3 <- x3[0:copies,]
x3$Consumption <- sgn*x3$Consumption
x3$Neg <- NULL
x3}
Run Code Online (Sandbox Code Playgroud)
然后我使用ddply应用该函数来删除数据中的这些错误行
ptm <- proc.time()
df_cleaned <- ddply(df, .(id,StartDate, EndDate, Consumption),
function(x){clean_negatives(x)})
proc.time() - ptm
Run Code Online (Sandbox Code Playgroud)
我希望我可以使用data.table来加快速度,但我无法理解如何使用data.table来帮助.
有1.3M的行,到目前为止,我的桌面整天都在计算,但还没有完成.
您的问题询问data.table实施情况.所以,我在这里展示了它.您的功能也可以大大简化.你可以先获得sign通过总结Neg,然后过滤表,然后乘Consumption用sign(如下图所示).
require(data.table)
# get the data.table in dt
dt <- data.table(df, key = c("id", "StartDate", "EndDate", "Consumption"))
# first obtain the sign directly
dt <- dt[, sign := sign(sum(Neg)), by = c("id", "StartDate", "EndDate", "Consumption")]
# then filter by abs(sum(Neg))
dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], by = c("id", "StartDate", "EndDate", "Consumption")]
# modifying for final output (line commented after Statquant's comment
# dt.fil$Consumption <- dt.fil$Consumption * dt.fil$sign
dt.fil[, Consumption := (Consumption*sign)]
dt.fil <- subset(dt.fil, select=-c(Neg, sign))
Run Code Online (Sandbox Code Playgroud)
标杆
百万行的数据:
#Create Sample Data Frame
num_rows <- 1e6
df <- data.frame(id=sample(1:20, num_rows, replace=T),
Consumption=sample(-20:20, num_rows, replace=T),
StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)
Run Code Online (Sandbox Code Playgroud)该data.table函数:
FUN.DT <- function() {
require(data.table)
dt <- data.table(df, key=c("id", "StartDate", "EndDate", "Consumption"))
dt <- dt[, sign := sign(sum(Neg)),
by = c("id", "StartDate", "EndDate", "Consumption")]
dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))],
by=c("id", "StartDate", "EndDate", "Consumption")]
dt.fil[, Consumption := (Consumption*sign)]
dt.fil <- subset(dt.fil, select=-c(Neg, sign))
}
Run Code Online (Sandbox Code Playgroud)你的功能 ddply
FUN.PLYR <- function() {
require(plyr)
clean_negatives <- function(x3) {
copies <- abs(sum(x3$Neg))
sgn <- ifelse(sum(x3$Neg) <0, -1, 1)
x3 <- x3[0:copies,]
x3$Consumption <- sgn*x3$Consumption
x3$Neg <- NULL
x3
}
df_cleaned <- ddply(df, .(id, StartDate, EndDate, Consumption),
function(x) clean_negatives(x))
}
Run Code Online (Sandbox Code Playgroud)基准测试rbenchmark(仅限1次运行)
require(rbenchmark)
benchmark(FUN.DT(), FUN.PLYR(), replications = 1, order = "elapsed")
test replications elapsed relative user.self sys.self user.child sys.child
1 FUN.DT() 1 6.137 1.000 5.926 0.211 0 0
2 FUN.PLYR() 1 242.268 39.477 152.855 82.881 0 0
Run Code Online (Sandbox Code Playgroud)我的data.table实现速度比您当前的plyr实现速度快39倍(我将其与您的实现进行比较,因为函数不同).
Note:我在函数中加载了包以获得获得结果的完整时间.此外,出于同样的原因我转换data.frame到data.table与基准函数内部键.因此,这是最低速度.