在我的基准测试项目中，Base R 对数据集的排序速度比 dplyr 或 data.table 快得多。为什么是这样？我们都应该使用 Base R 吗？

Question

在我的基准测试项目中，Base R 对数据集的排序速度比 dplyr 或 data.table 快得多。为什么是这样？我们都应该使用 Base R 吗？

aSt*_*orn 0 performance r dplyr data.table

我正在比较不同数据操作包在不同大小的数据集上对某些操作的执行情况。

我生成了一个虚拟数据集（iris x iris 的笛卡尔积。毫无意义，但本质上只是一个 22500 x 10 数据集）。

library(dplyr)
library(data.table)
library(rbenchmark)

iris_big <- merge(x = iris, y = iris, by = NULL) 

iris_big_dt <- as.data.table(iris_big) #for data.table

benchmark("Base R" = {
            iris_big[base::order("Petal.Width.y")]
          },
          "dplyr" = {
            dplyr::arrange(iris_big,"Petal.Width.y")
          },          

          "data.table" = {
            data.table::setorder(iris_big_dt,"Petal.Width.y")
          },
          replications = 30,
          columns = c("test", "replications", "elapsed",
                      "relative", "user.self", "sys.self"))

Run Code Online (Sandbox Code Playgroud)

输出：：

| test       | replications   |elapsed|...|sys.self|
| --------   | -------------- |----   |---|---|
| Base R     | 30             |0.00   |...|0.00|
| data.table | 30             |0.04   |...|0.02|
| dplyr      | 30             |1.55   |...|0.00|

Run Code Online (Sandbox Code Playgroud)

为什么基础R这么快？为什么 dplyr 这么慢？难道我做错了什么？谢谢

Answer 1

r2e*_*ans 9

问题识别

没有验证输出是否正确以及实现之间是否相同/等效：它们不是。第一个是单个（未排序）列，第二个只是未排序。

iris_big[..]（即，base::[原语）没有逗号是选择columns，而不是rows。添加尾随逗号。
base::order("Petal.Width.y")，无论是否在内iris_big[..]，总是返回单数静态1，因为它正在对长度为 1 的字符向量进行排序（即，c("Petal.Width.y")不关心它是否可能引用封闭框架中的列名称）。因此，它返回第一列，而不更改行顺序。返回值的维度错误这一事实应该强烈暗示这已被破坏。（此评论的开头归功于@DonaldSeinen。）

这实际上是其中之一
```
iris_big[1]     # just the first column
iris_big[1,]    # just the first row
```
Run Code Online (Sandbox Code Playgroud)
这是固定的
```
iris_big[base::order(iris_big$Petal.Width.y),]
```
Run Code Online (Sandbox Code Playgroud)
同样，dplyr::arrange(iris_big, "Petal.Width.y")以同样的方式被打破。如果我们继续快速检查以确保该列没有减少，我们会看到
```
dplyr::arrange(iris_big, "Petal.Width.y") %>%
  summarize(nondecr = all(diff(Petal.Width.y) >= 0))
#   nondecr
# 1   FALSE
```
Run Code Online (Sandbox Code Playgroud)
这是通过取消引用来修复的：
```
dplyr::arrange(iris_big, Petal.Width.y) %>%
summarize(nondecr = all(diff(Petal.Width.y) >= 0))
#   nondecr
# 1    TRUE
```
Run Code Online (Sandbox Code Playgroud)

base 和 dplyr 变体的“引用”问题因以下事实而令人困惑：base R 不使用非标准评估 (NSE)，需要dplyrNSE in arrange，并且data.table::setorder似乎使用引用或未引用（尽管其说明“请勿引用”）列名称”中?setorder。

（缺少逗号的第一个项目符号,也被混淆data.table，因为iris_big_dt[1]返回第一行，而不是第一列。虽然我认为我理解这种设计选择的一些最初动力，但我一直认为这是一个损坏的快捷方式：它每年可能会节省数千（？）个原本不必要的逗号，但代价是在读取 base/data.table 代码时产生歧义。）

正确性/相同性验证

基准测试的一个重要检查是结果（1）全部正确，并且（2）相同。单独检查每个，我们看到：

ret1wrong1 <- iris_big[base::order("Petal.Width.y")]
ret1wrong2 <- iris_big[base::order("Petal.Width.y"),]      # add comma
ret1 <- iris_big[base::order(iris_big$Petal.Width.y),]     # unquote, add comma
ret2wrong <- dplyr::arrange(iris_big, "Petal.Width.y")
ret2 <- dplyr::arrange(iris_big, Petal.Width.y)            # unquote
ret3 <- data.table::setorder(iris_big_dt, "Petal.Width.y")

range(iris_big$Petal.Width.y) # informative
# [1] 0.1 2.5

head(ret1wrong1)          # wrong, single column
#   Sepal.Length.x
# 1            5.1
# 2            4.9
# 3            4.7
# 4            4.6
# 5            5.0
# 6            5.4
ret1wrong2                # wrong, single row
#   Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# 1            5.1           3.5            1.4           0.2    setosa            5.1           3.5            1.4           0.2    setosa
head(ret1)                # CORRECT
#      Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# 1351            5.1           3.5            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 1352            4.9           3.0            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 1353            4.7           3.2            1.3           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 1354            4.6           3.1            1.5           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 1355            5.0           3.6            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 1356            5.4           3.9            1.7           0.4    setosa            4.9           3.1            1.5           0.1    setosa
all(diff(ret1$Petal.Width.y) >= 0)
# [1] TRUE

head(ret2wrong)           # first petal.Width.y is 0.2 not 0.1
#   Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# 1            5.1           3.5            1.4           0.2    setosa            5.1           3.5            1.4           0.2    setosa
# 2            4.9           3.0            1.4           0.2    setosa            5.1           3.5            1.4           0.2    setosa
# 3            4.7           3.2            1.3           0.2    setosa            5.1           3.5            1.4           0.2    setosa
# 4            4.6           3.1            1.5           0.2    setosa            5.1           3.5            1.4           0.2    setosa
# 5            5.0           3.6            1.4           0.2    setosa            5.1           3.5            1.4           0.2    setosa
# 6            5.4           3.9            1.7           0.4    setosa            5.1           3.5            1.4           0.2    setosa
all(diff(ret2wrong$Petal.Width.y) >= 0)
# [1] FALSE
head(ret2)                # CORRECT
#   Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# 1            5.1           3.5            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 2            4.9           3.0            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 3            4.7           3.2            1.3           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 4            4.6           3.1            1.5           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 5            5.0           3.6            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 6            5.4           3.9            1.7           0.4    setosa            4.9           3.1            1.5           0.1    setosa
all(diff(ret2$Petal.Width.y) >= 0)
# [1] TRUE

head(ret3)
#    Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
#             <num>         <num>          <num>         <num>    <fctr>          <num>         <num>          <num>         <num>    <fctr>
# 1:            5.1           3.5            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 2:            4.9           3.0            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 3:            4.7           3.2            1.3           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 4:            4.6           3.1            1.5           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 5:            5.0           3.6            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 6:            5.4           3.9            1.7           0.4    setosa            4.9           3.1            1.5           0.1    setosa
all(diff(ret3$Petal.Width.y) >= 0)
# [1] TRUE

all.equal(ret1, ret2, check.attributes = FALSE)
# [1] TRUE
all.equal(ret1, ret3, check.attributes = FALSE)
# [1] TRUE

Run Code Online (Sandbox Code Playgroud)

（我们需要check.attributes=FALSE，因为否则它会抱怨行名和类差异，这在数据比较中并不重要。）

修正基准

现在我们已经确定了平等，让我们对它们进行基准测试：

iris_big_dt1 <- as.data.table(iris_big) #for data.table
iris_big_dt2 <- as.data.table(iris_big) #for data.table

bench::mark(
  "Base R" = {
    iris_big[base::order(iris_big$Petal.Width.y),]
  },
  "dplyr" = {
    dplyr::arrange(iris_big, Petal.Width.y)
  },
  "data.table 1" = {
    data.table::setorder(iris_big_dt1, "Petal.Width.y")
  },
  "data.table 2" = {
    data.table::setorder(copy(iris_big_dt2), "Petal.Width.y")
  },
  min_iterations = 1000,
  check = FALSE)
# # A tibble: 4 x 13
#   expression        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                  time             gc                  
#   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                  <list>           <list>              
# 1 Base R         3.33ms   3.61ms      262.    1.97MB    5.08    981    19      3.74s <NULL> <Rprofmem[,3] [13 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 2 dplyr          3.75ms   4.32ms      216.    1.63MB    3.74    983    17      4.55s <NULL> <Rprofmem[,3] [15 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 3 data.table 1   1.19ms   1.37ms      713.   87.94KB    0.714   999     1       1.4s <NULL> <Rprofmem[,3] [1 x 3]>  <bch:tm [1,000]> <tibble [1,000 x 3]>
# 4 data.table 2   2.66ms   3.26ms      304.    1.84MB    5.56    982    18      3.23s <NULL> <Rprofmem[,3] [15 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>

all(diff(iris_big_dt1$Petal.Width.y)>=0)
# [1] TRUE
all(diff(iris_big_dt2$Petal.Width.y)>=0)
# [1] FALSE

Run Code Online (Sandbox Code Playgroud)

我包含了该data.table变体的两个版本，因为可能会质疑对已经排序的表（由于其引用/就地操作）进行第二次排序会更快。copy即使每次都施加数据的开销，该data.table 2变体仍然明显比Base R和更快dplyr。

非常感谢 @r2evans 花时间写这篇非常有启发性的文章，它提供了必要的信息。 (4认同)
如果投票基于细节和全面性，r2evans 应该至少有 >400k 代表点 (2认同)

归档时间：	4 年前
查看次数：	1133 次
最近记录：	4 年前