在大型数据集中提交每个案例时计算公开案例的有效方法

Question

在大型数据集中提交每个案例时计算公开案例的有效方法

Mek*_*lay 7 performance r vectorization dplyr data.table

在大型数据集(约1M个案例)中,每个案例都有一个"已创建"和一个"审查" dateTime.我想计算每个案例创建时打开的其他案例的数量.案件在他们的"创造"和"审查"之间开放dataTimes.

有几种解决方案适用于小型数据集(<100,000个案例),但计算时间呈指数增长.我的估计是计算时间增加为函数3n ^ 2.在n = 100,000的情况下,我的服务器上的计算时间大于20分钟,具有6*4GHz内核和64GB RAM.即使使用多核库,充其量也可以将时间减少8或10倍.不足以处理大约1M的情况.

我正在寻找一种更有效的方法来进行这种计算.下面我提供了一个函数,允许您轻松创建大量"创建"和"删失" dateTime对以及目前为止尝试的两个解决方案,使用dplyr和data.table库.为简单起见,将时间报告给用户.您只需更改顶部的"CASE_COUNT"变量即可重新执行并再次查看时间,并轻松比较您可能需要建议的其他解决方案的时间.

我将使用其他解决方案更新原始帖子,以便给予作者适当的信任.在此先感谢您的帮助!

# Load libraries used in this example
library(dplyr);
library(data.table);
# Not on CRAN. See: http://bioconductor.org/packages/release/bioc/html/IRanges.html
library(IRanges);

# Set seed for reproducibility 
set.seed(123)

# Set number of cases & date range variables
CASE_COUNT  <<- 1000;
RANGE_START <- as.POSIXct("2000-01-01 00:00:00", 
                          format="%Y-%m-%d %H:%M:%S", 
                          tz="UTC", origin="1970-01-01");
RANGE_END   <- as.POSIXct("2012-01-01 00:00:00", 
                          format="%Y-%m-%d %H:%M:%S", 
                          tz="UTC", origin="1970-01-01");

# Select which solutions you want to run in this test           
RUN_SOLUTION_1 <- TRUE;     # dplyr::summarize() + comparisons
RUN_SOLUTION_2 <- TRUE;     # data.table:foverlaps()
RUN_SOLUTION_3 <- TRUE;     # data.table aggregation + comparisons
RUN_SOLUTION_4 <- TRUE;     # IRanges::IRanges + countOverlaps()
RUN_SOLUTION_5 <- TRUE;     # data.table::frank()

# Function to generate random creation & censor dateTime pairs
# The censor time always has to be after the creation time
# Credit to @DirkEddelbuettel for this smart function
# (https://stackoverflow.com/users/143305/dirk-eddelbuettel)

generate_cases_table <- function(n = CASE_COUNT, start_val=RANGE_START, end_val=RANGE_END) {
    # Measure duration between start_val & end_val
    duration <- as.numeric(difftime(end_val, start_val, unit="secs"));

    # Select random values in duration to create start_offset
    start_offset   <- runif(n, 0, duration);

    # Calculate the creation time list
    created_list  <- start_offset + start_val;

    # Calculate acceptable time range for censored values
    # since they must always be after their respective creation value
    censored_range <- as.numeric(difftime(RANGE_END, created_list, unit="secs"));

    # Select random values in duration to create end_offset
    creation_to_censored_times <- runif(n, 0, censored_range);

    censored_list <- created_list + creation_to_censored_times;

    # Create and return a data.table with creation & censor values
    # calculated from start or end with random offsets
    return_table  <- data.table(id       = 1:n,
                                created  = created_list,
                                censored = censored_list);

    return(return_table);
}

# Create the data table with the desired number of cases specified by CASE_COUNT above
cases_table <- generate_cases_table();

solution_1_function <- function (cases_table) { 
    # SOLUTION 1: Using dplyr::summarize:

    # Group by id to set parameters for summarize() function 
    cases_table_grouped <- group_by(cases_table, id);

    # Count the instances where other cases were created before
    # and censored after each case using vectorized sum() within summarize()

    cases_table_summary <- summarize(cases_table_grouped, 
                           open_cases_at_creation = sum((cases_table$created  < created & 
                                                         cases_table$censored > created)));
    solution_1_table <<- as.data.table(cases_table_summary, key="id");        
} # End solution_1_function

solution_2_function <- function (cases_table) {
    # SOLUTION 2: Using data.table::foverlaps:

    # Adapted from solution provided by @Davidarenburg
    # (https://stackoverflow.com/users/3001626/david-arenburg)

    # The foverlaps() solution tends to crash R with large case counts
    # I suspect it has to do with memory assignment of the very large objects
    # It maxes RAM on my system (64GB) before crashing, possibly attempting
    # to write beyond its assigned memory limits.
    # I'll submit a reproduceable bug to the data.table team since
    # foverlaps() is pretty new and known to be occasionally unstable

    if (CASE_COUNT > 50000) {
        stop("The foverlaps() solution tends to crash R with large case counts. Not running.");
    }

    setDT(cases_table)[, created_dupe := created];
    setkey(cases_table, created, censored);

    foverlaps_table  <- foverlaps(cases_table[,c("id","created","created_dupe"), with=FALSE],
                                  cases_table[,c("id","created","censored"),    with=FALSE], 
                                  by.x=c("created","created_dupe"))[order(i.id),.N-1,by=i.id];

    foverlaps_table  <- dplyr::rename(foverlaps_table, id=i.id, open_cases_at_creation=V1);

    solution_2_table <<- as.data.table(foverlaps_table, key="id");
} # End solution_2_function

solution_3_function <- function (cases_table) {    
    # SOLUTION 3: Using data.table aggregation instead of dplyr::summarize

    # Idea suggested by @jangorecki
    # (https://stackoverflow.com/users/2490497/jangorecki)

    # Count the instances where other cases were created before
    # and censored after each case using vectorized sum() with data.table aggregation

    cases_table_aggregated <- cases_table[order(id), sum((cases_table$created  < created & 
                                                     cases_table$censored > created)),by=id];   

    solution_3_table <<- as.data.table(dplyr::rename(cases_table_aggregated, open_cases_at_creation=V1), key="id");

} # End solution_3_function

solution_4_function <- function (cases_table) { 
    # SOLUTION 4: Using IRanges package

    # Adapted from solution suggested by @alexis_laz
    # (https://stackoverflow.com/users/2414948/alexis-laz)

    # The IRanges package generates ranges efficiently, intended for genome sequencing
    # but working perfectly well on this data, since POSIXct values are numeric-representable
    solution_4_table <<- data.table(id      = cases_table$id,
                     open_cases_at_creation = countOverlaps(IRanges(cases_table$created, 
                                                                    cases_table$created), 
                                                            IRanges(cases_table$created, 
                                                                    cases_table$censored))-1, key="id");

} # End solution_4_function

solution_5_function <- function (cases_table) {
    # SOLUTION 5: Using data.table::frank()

    # Adapted from solution suggested by @danas.zuokas
    # (https://stackoverflow.com/users/1249481/danas-zuokas)

    n <- CASE_COUNT;

    # For every case compute the number of other cases
    # with `created` less than `created` of other cases
    r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]), ties.method = 'first')[1:n];

    # For every case compute the number of other cases
    # with `censored` less than `created`
    r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]), ties.method = 'first')[1:n];

    solution_5_table <<- data.table(id      = cases_table$id,
                     open_cases_at_creation = r1 - r2, key="id");

} # End solution_5_function;

# Execute user specified functions;
if (RUN_SOLUTION_1)
    solution_1_timing <- system.time(solution_1_function(cases_table)); 
if (RUN_SOLUTION_2) {
    solution_2_timing <- try(system.time(solution_2_function(cases_table)));
    cases_table <- select(cases_table, -created_dupe);
}
if (RUN_SOLUTION_3)
    solution_3_timing <- system.time(solution_3_function(cases_table)); 
if (RUN_SOLUTION_4)
    solution_4_timing <- system.time(solution_4_function(cases_table));
if (RUN_SOLUTION_5)
    solution_5_timing <- system.time(solution_5_function(cases_table));         

# Check generated tables for comparison
if (RUN_SOLUTION_1 && RUN_SOLUTION_2 && class(solution_2_timing)!="try-error") {
    same_check1_2 <- all(solution_1_table$open_cases_at_creation == solution_2_table$open_cases_at_creation);
} else {same_check1_2 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_3) {
    same_check1_3 <- all(solution_1_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check1_3 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_4) {
    same_check1_4 <- all(solution_1_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check1_4 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_5) {
    same_check1_5 <- all(solution_1_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check1_5 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_3 && class(solution_2_timing)!="try-error") {
    same_check2_3 <- all(solution_2_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check2_3 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_4 && class(solution_2_timing)!="try-error") {
    same_check2_4 <- all(solution_2_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check2_4 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_5 && class(solution_2_timing)!="try-error") {
    same_check2_5 <- all(solution_2_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check2_5 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_4) {
    same_check3_4 <- all(solution_3_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check3_4 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_5) {
    same_check3_5 <- all(solution_3_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check3_5 <- TRUE;}
if (RUN_SOLUTION_4 && RUN_SOLUTION_5) {
    same_check4_5 <- all(solution_4_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check4_5 <- TRUE;}


same_check    <- all(same_check1_2, same_check1_3, same_check1_4, same_check1_5,
                     same_check2_3, same_check2_4, same_check2_5, same_check3_4,
                     same_check3_5, same_check4_5);

# Report summary of results to user
cat("This execution was for", CASE_COUNT, "cases.\n",
    "It is", same_check, "that all solutions match.\n");
if (RUN_SOLUTION_1)
    cat("The dplyr::summarize() solution took", solution_1_timing[3], "seconds.\n");
if (RUN_SOLUTION_2 && class(solution_2_timing)!="try-error")
    cat("The data.table::foverlaps() solution took", solution_2_timing[3], "seconds.\n");
if (RUN_SOLUTION_3)
    cat("The data.table aggregation solution took", solution_3_timing[3], "seconds.\n");
if (RUN_SOLUTION_4)
    cat("The IRanges solution solution took", solution_4_timing[3], "seconds.\n");
if (RUN_SOLUTION_5)
    cat("The data.table:frank() solution solution took", solution_5_timing[3], "seconds.\n\n");

Run Code Online (Sandbox Code Playgroud)

data.table::foverlaps()对于更少的情况,解决方案更快(<5000左右;除了n之外还取决于随机性,因为它使用二进制搜索来优化).dplyr::summarize()对于更多情况(> 5,000左右),解决方案更快.超过100,000,这两种解决方案都不可行,因为它们都太慢了.

编辑:添加了第三个解决方案,基于@jangorecki建议使用data.table聚合代替的概念dplyr::summarize(),并且与dplyr解决方案类似.对于大约50,000个案例,它是最快的解决方案.超过50,000个案例,dplyr::summarize()解决方案略快,但不是很多.可悲的是,对于1M病例,它仍然不实用.

EDIT2:添加了第四个解决方案,该解决方案改编自@alexis_laz建议使用该IRanges软件包及其countOverlaps功能的解决方案.它明显快于其他3种解决方案.50,000个案例比解决方案1和3快了近400%.

EDIT3:修改案例生成功能,以适当地运用"审查"条件.感谢@jangorecki捕获以前版本的限制.

编辑4:重写以允许用户选择执行哪些解决方案,并system.time()在每次执行之前用于与垃圾收集进行时序比较,以获得更准确的计时(根据@ jangorecki的敏锐观察) - 还添加了一些针对崩溃情况的条件检查.

EDIT5:添加了第五个解决方案,改编自@ danas.zuokas建议的解决方案rank().我的实验表明,它总是至少比其他解决方案慢一个数量级.在10,000个案例中,解决方案需要44秒,而3.5秒dplyr::summarize和0.36秒IRanges.

最终编辑:我对@ danas.zuokas建议的解决方案5做了一些修改,并对@Khashaa关于类型的观察进行了匹配.我已经as.numeric在dataTime生成函数中设置了类型,rank它在操作时integers或doubles代替dateTime对象时大幅加速(也提高了其他函数的速度,但没有大幅提升).通过一些测试,设置ties.method='first'产生的结果与意图一致. data.table::frank比速度更快base::rank和IRanges::rank. bit64::rank是最快的,但它似乎处理不同的关系data.table::frank,我不能让它按需要处理它们.一旦bit64加载,它会掩盖大量的类型和功能,改变data.table::frank一路上的结果.具体原因超出了本问题的范围.

POST END注意:结果是有效data.table::frank处理POSIXct dateTimes,但似乎都base::rank没有IRanges::rank.因此,即使as.numeric(或as.integer)类型设置也不是必需的,data.table::frank并且转换没有精度损失,因此ties.method差异较少.谢谢所有贡献的人!我学到了很多!非常感激!:)信用将包含在我的源代码中.

ENDNOTE:这个问题是一个精炼和澄清的版本,更易于使用和更易读的示例代码,更有效的方法来计算每个案例的创建时间的开放案例 - 我在这里分开它不会压倒原始帖子与太多的编辑并简化dataTime了示例代码中大量对的创建.这样,你就不必努力回答.再次感谢!

Answer 1

dan*_*kas 4

答案已根据问题作者的评论进行更新。

我建议使用排名的解决方案。表格是按照此问题的后续内容创建的，或者使用dateTime本问题中的对生成函数创建的。两者都应该有效。

n <- cases_table[, .N]

# For every case compute the number of other cases
# with `created` less than `creation` of other cases
r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]),
           ties.method = 'first')[1:n]

# For every case compute the number of other cases
# with `censored` less than `created`
r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]),
           ties.method = 'first')[1:n]

Run Code Online (Sandbox Code Playgroud)

取差值r1 - r2（ties.method='first' 不需要 -1）给出结果（消除的等级created）。就效率而言，只需要找到中行数长度的向量的排名cases_table。 data.table::frank处理POSIXct dateTime对象的速度与处理对象的速度一样快numeric（与不同base::rank），因此不需要类型转换。

归档时间：	10 年，2 月前
查看次数：	313 次
最近记录：	10 年，2 月前