data.table vs dplyr:一个人能做得好吗,另一个做不好或做得不好?

BrodieG 719 r dplyr data.table

概观

我比较熟悉data.table,而不是那么熟悉dplyr.我已经阅读了一些出现在SO上的dplyr小插曲和例子,到目前为止我的结论是:

  1. data.table并且dplyr在速度上具有可比性,除非有许多(即> 10-100K)组,并且在某些其他情况下(参见下面的基准)
  2. dplyr 有更多可访问的语法
  3. dplyr 摘要(或将)潜在的DB交互
  4. 有一些小的功能差异(参见下面的"示例/用法")

在我看来2.没有多大的重量,因为我对它很熟悉data.table,虽然我明白对于那些对这两者都不熟悉的用户来说这将是一个很重要的因素.我想避免争论哪个更直观,因为这与我从已经熟悉的人的角度提出的具体问题无关data.table.我还想避免讨论"更直观"如何导致更快的分析(当然是真的,但同样,不是我最感兴趣的).

我想知道的是:

  1. 对于熟悉软件包的人来说,是否需要使用一个或另一个软件包来编写分析任务更加容易(例如,需要按键的一些组合与所需的深奥水平相结合,其中每个项目的好处都是好事).
  2. 是否存在在一个包装与另一个包装中更有效地执行分析任务(即,超过2倍)的分析任务.

最近的一个问题让我更多地思考这个问题,因为直到那时我才认为dplyr会提供超出我已经做过的东西data.table.这是dplyr解决方案(Q末尾的数据):

dat %.%
  group_by(name, job) %.%
  filter(job != "Boss" | year == min(year)) %.%
  mutate(cumu_job2 = cumsum(job2))

这比我的黑客尝试data.table解决方案要好得多.也就是说,好的data.table解决方案也相当不错(感谢Jean-Robert,Arun,并注意到这里我赞成对最严格的最佳解决方案的单一陈述):

setDT(dat)[,
  .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], 
  by=list(id, job)
]

后者的语法可能看起来非常深奥,但如果你习惯了data.table(即不使用一些更深奥的技巧),它实际上非常简单.

理想情况下,我希望看到的是一些很好的例子,dplyr或者data.table方式更简洁或表现更好.

例子

用法
  • dplyr不允许返回任意行数的分组操作(来自eddi的问题,请注意:这看起来它将在dplyr 0.5中实现,同样,@ beginneR显示了do在@ eddi问题的答案中使用的潜在解决方法).
  • data.table支持滚动连接(感谢@dholstius)以及重叠连接
  • data.table内部优化形式的表达式DT[col == value]DT[col %in% values]对于速度通过自动索引,它使用二进制搜索,同时使用相同的基础R语法.请参阅此处了解更多详细信息和一个小基准.
  • dplyr提供的功能(例如,标准的评估版regroup,summarize_each_),可以简化程序中使用的dplyr(请注意程序中使用的data.table是绝对有可能的,只是需要一些认真思考,置换/报价,等等,至少据我所知)
基准

数据

这是我在问题部分中展示的第一个例子.

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", 
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", 
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", 
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", 
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", 
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, 
-16L))

Arun.. 495

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed, Memory usage, Syntax and Features.

My intent is to cover each one of these as clearly as possible from data.table perspective.

Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.


The data.table syntax is consistent in its form - DT[i, j, by]. To keep i, j and by together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax.

1. Speed

Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas. See also updated benchmarks, which include Spark and pydatatable as well.

On benchmarks, it would be great to cover these remaining aspects as well:

  • Grouping operations involving a subset of rows - i.e., DT[x > val, sum(y), by = z] type operations.

  • Benchmark other operations such as update and joins.

  • Also benchmark memory footprint for each operation in addition to runtime.

2. Memory usage

  1. Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). See this post.

    Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.

  2. data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).

    # sub-assign by reference, updates 'y' in-place
    DT[x >= 1L, y := NA]
    

    But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):

    # copies the entire 'y' column
    ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
    

    A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.

    Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:

    foo <- function(DT) {
        DT = shallow(DT)          ## shallow copy DT
        DT[, newcol := 1L]        ## does not affect the original DT 
        DT[x > 2L, newcol := 2L]  ## no need to copy (internally), as this column exists only in shallow copied DT
        DT[x > 2L, x := 3L]       ## have to copy (like base R / dplyr does always); otherwise original DT will 
                                  ## also get modified.
    }
    

    By not using shallow(), the old functionality is retained:

    bar <- function(DT) {
        DT[, newcol := 1L]        ## old behaviour, original DT gets updated by reference
        DT[x > 2L, x := 3L]       ## old behaviour, update column x in original DT.
    }
    

    通过创建一个浅拷贝使用shallow(),我们明白,您不想修改原来的对象.我们在内部处理所有事情,以确保在确保复制列时仅在绝对必要时修改.实施时,这应该完全解决参考透明度问题,同时为用户提供两种可能性.

    另外,一旦shallow()导出,dplyr的data.table接口应该避免几乎所有的副本.所以那些喜欢dplyr语法的人可以将它与data.tables一起使用.

    但它仍然缺少data.table提供的许多功能,包括(sub)-assignment by reference.

  3. 加入时聚合:

    假设您有两个data.tables如下:

    DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
    #    x y z
    # 1: 1 a 1
    # 2: 1 a 2
    # 3: 1 b 3
    # 4: 1 b 4
    # 5: 2 a 5
    # 6: 2 a 6
    # 7: 2 b 7
    # 8: 2 b 8
    DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
    #    x y mul
    # 1: 1 a   4
    # 2: 2 b   3
    

    并且您希望sum(z) * mulDT2按列加入时获取每一行x,y.我们可以:

    • 1)聚合DT1得到sum(z),2)执行连接和3)乘(或)

      # data.table way
      DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]
      
      # dplyr equivalent
      DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
          right_join(DF2) %>% mutate(z = z * mul)
      
    • 2)一次完成(使用by = .EACHI功能):

      DT1[DT2, list(z=sum(z) * mul), by = .EACHI]
      

    有什么好处?

    • 我们不必为中间结果分配内存.

    • 我们没有两次分组/哈希(一个用于聚合,另一个用于加入).

    • 更重要的是,通过查看j(2),我们想要执行的操作是清楚的.

    查看此帖子以获取详细说明by = .EACHI.没有实现中间结果,并且连接+聚合一次性完成.

    Have a look at this, this and this posts for real usage scenarios.

    In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).

  4. Update and joins:

    Consider the data.table code shown below:

    DT1[DT2, col := i.mul]
    

    adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary.

    Check this post for a real usage scenario.

To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!

3. Syntax

Let's now look at syntax. Hadley commented here:

Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...

I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.

We will work with the dummy data shown below:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)
  1. Basic aggregation/update operations.

    # case (a)
    DT[, sum(y), by = z]                       ## data.table syntax
    DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
    DT[, y := cumsum(y), by = z]
    ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))
    
    # case (b)
    DT[x > 2, sum(y), by = z]
    DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
    DT[x > 2, y := cumsum(y), by = z]
    ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))
    
    # case (c)
    DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]
    DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])
    DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]
    DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])
    
    • data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).

    • In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum.

      This is what we mean when we say the DT[i, j, by] form is consistent.

    • Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value.

      While it returns the same result, using filter() here makes the actual operation less obvious.

      It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to.

  2. Aggregation/update on multiple columns

    # case (a)
    DT[, lapply(.SD, sum), by = z]                     ## data.table syntax
    DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
    DT[, (cols) := lapply(.SD, sum), by = z]
    ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))
    
    # case (b)
    DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]
    DF %>% group_by(z) %>% summarise_each(funs(sum, mean))
    
    # case (c)
    DT[, c(.N, lapply(.SD, sum)), by = z]     
    DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
    
    • In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs().

    • data.table's := requires column names to be provided, whereas dplyr generates it automatically.

    • In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.

    • In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list.

    Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions.

    You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and .

  3. Joins

    dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative.

    setkey(DT1, x, y)
    
    # 1. normal join
    DT1[DT2]            ## data.table syntax
    left_join(DT2, DT1) ## dplyr syntax
    
    # 2. select columns while join    
    DT1[DT2, .(z, i.mul)]
    left_join(select(DT2, x, y, mul), select(DT1, x, y, z))
    
    # 3. aggregate while join
    DT1[DT2, .(sum(z) * i.mul), by = .EACHI]
    DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
        inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)
    
    # 4. update while join
    DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]
    ??
    
    # 5. rolling join
    DT1[DT2, roll = -Inf]
    ??
    
    # 6. other arguments to control output
    DT1[DT2, mult = "first"]
    ??
    
    • Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R.

    • However dplyr joins do just that. Nothing more. Nothing less.

    • data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.

    • data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. Why materialse the entire join result to add/update just a few columns?

    • data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.

    • data.table also has mult = argument which selects first, last or all matches (6).

    • data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.

Once again, the syntax is consistent with DT[i, j, by] with additional arguments allowing for controlling the output further.

  1. do()...

    dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value.

    DT[, list(x[1], y[1]), by = z]                 ## data.table syntax
    DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
    DT[, list(x[1:2], y[1]), by = z]
    DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))
    
    DT[, quantile(x, 0.25), by = z]
    DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
    DT[, quantile(x, c(0.25, 0.75)), by = z]
    DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))
    
    DT[, as.list(summary(x)), by = z]
    DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
    
    • .SD's equivalent is .

    • In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column.

    • In dplyr, cannot do that. Have to resort to do() depending on how sure you are as to whether your function would always return a single value. And it is quite slow.

Once again, data.table's syntax is consistent with DT[i, j, by]. We can just keep throwing expressions in j without having to worry about these things.

Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.

data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.

But one should also consider the number of features that dplyr lacks in comparison to data.table.

4. Features

I have pointed out most of the features here and also in this post. In addition:

  • fread - fast file reader has been available for a long time now.

  • fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.

  • Automatic indexing - another handy feature to optimise base R syntax as is, internally.

  • Ad-hoc grouping: dplyr automatically sorts the results by grouping variables during summarise(), which may not be always desirable.

  • Numerous advantages in data.table joins (for speed/memory efficiency and syntax) mentioned above.

  • Non-equi joins: Allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins.

  • Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.

  • setorder() function in data.table that allows really fast reordering of data.tables by reference.

  • dplyr provides interface to databases using the same syntax, which data.table does not at the moment.

  • data.table provides faster equivalents of set operations (written by Jan Gorecki) - fsetdiff, fintersect, funion and fsetequal with additional all argument (as in SQL).

  • data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. dplyr changes base functions filter, lag and [ which can cause problems; e.g. here and here.


Finally:

  • On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.

  • On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).

    • Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using OpenMP.

  • @bluefeet:我不认为你通过讨论聊天来为我们其他人提供任何出色的服务.我的印象是Arun是开发人员之一,这可能会产生有用的见解. (7认同)
  • 我认为你所使用的任何地方通过引用(`:=`),`dplyr`等效的地方也应该使用`<-`,如`DF < - DF%>%mutate ...`而不仅仅是` DF%>%mutate ...` (6认同)
  • 我已经阅读了很多篇文章,它对我理解data.table以及更好地使用它有很大帮助。在大多数情况下,我更喜欢data.table而不是dplyr或pandas或PL / pgSQL。但是,我不能停止思考如何表达它。语法*不*简单,清晰或冗长。实际上,即使我经常使用data.table之后,我还是常常难以理解自己的代码,我实际上是一周前写的。这是只写语言的生活示例。https://en.wikipedia.org/wiki/Write-only_language因此,希望,有一天,我们将能够在data.table上使用dplyr。 (6认同)
  • 关于语法.我相信`dplyr`可以是谁使用`plyr`语法的用户更容易,但`data.table`可以是谁使用的查询语言语法像`SQL`用户更容易,和它背后的关系代数,这是所有关于表格数据转换.@Arun你应该注意**set操作符**通过包装`data.table`函数非常容易实现,当然会带来显着的加速. (4认同)

hadley.. 370

这是我从dplyr角度全面回答的尝试,遵循Arun回答的大致轮廓(但根据不同的优先级进行了一些重新安排).

句法

语法有一些主观性,但我坚持认为data.table的简洁使得学习更难学,更难阅读.这部分是因为dplyr解决了一个更容易的问题!

dplyr为您做的一件非常重要的事情是它 限制了您的选择.我声称大多数单表问题都可以通过五个关键动词过滤,选择,变异,排列和总结以及"按组"副词来解决.当您学习数据操作时,这种约束是一个很大的帮助,因为它有助于您对问题的思考.在dplyr中,每个动词都映射到一个函数.每个功能都可以完成一项工作,并且易于理解.

您可以通过与这些管道的简单操作共创复杂 %>%.以下是Arun 链接到的帖子中的一个示例:

diamonds %>%
  filter(cut != "Fair") %>%
  group_by(cut) %>%
  summarize(
    AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = n()
  ) %>%
  arrange(desc(Count))

即使你以前从未见过dplyr(甚至是R!),你仍然可以得到正在发生的事情的要点,因为这些函数都是英语动词.英语动词的缺点是它们需要更多的打字 [,但我认为通过更好的自动完成可以在很大程度上减轻这种情况.

这是等效的data.table代码:

diamondsDT <- data.table(diamonds)
diamondsDT[
  cut != "Fair", 
  .(AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = .N
  ), 
  by = cut
][ 
  order(-Count) 
]

除非您已经熟悉data.table,否则更难遵循此代码.(我也无法弄清楚如何以[ 一种看起来很好看的方式缩进重复).就个人而言,当我看到我6个月前写的代码时,就像查看陌生人编写的代码一样,所以我更喜欢简单明了的代码.

我认为其他两个次要因素会略微降低可读性:

  • 由于几乎每个数据表操作都使用[您需要额外的上下文来确定发生了什么.例如,是x[y] 从数据框连接两个数据表还是从中提取列?这只是一个小问题,因为在编写良好的代码中,变量名称应该表明发生了什么.

  • 我喜欢这group_by()是dplyr中的一个单独操作.它从根本上改变了计算,所以我认为在略读代码时应该是显而易见的,并且它group_by()by参数更容易发现[.data.table.

我也喜欢管道 不仅限于一个包装.您可以先用tidyr整理数据,然后用 ggvis中的绘图完成.而且你不仅限于我写的软件包 - 任何人都可以编写一个函数来构成数据操作管道的无缝部分.实际上,我更喜欢以前用%>%以下内容重写的data.table代码:

diamonds %>% 
  data.table() %>% 
  .[cut != "Fair", 
    .(AvgPrice = mean(price),
      MedianPrice = as.numeric(median(price)),
      Count = .N
    ), 
    by = cut
  ] %>% 
  .[order(-Count)]

管道的概念%>%不仅限于数据框架,而且很容易推广到其他环境:交互式网页图形,网页抓取, 要点,运行时间合同 ......)

记忆和表现

我把它们混在一起,因为对我来说,它们并不那么重要.大多数R用户的工作量低于100万行,而dplyr足够快,足以满足您不了解处理时间的数据量.我们优化dplyr以表达对中等数据的表达; 随时可以使用data.table获取更大数据的原始速度.

dplyr的灵活性还意味着您可以使用相同的语法轻松调整性能特征.如果dplyr与数据帧后端的性能不够好,则可以使用data.table后端(尽管功能有限).如果您使用的数据不适合内存,则可以使用数据库后端.

总而言之,dplyr的表现会在长期内变得更好.我们肯定会实现data.table的一些好主意,比如基数排序和联接和过滤器使用相同的索引.我们还致力于并行化,因此我们可以利用多个内核.

特征

我们计划在2015年开展的一些工作:

  • readr软件包,可以轻松地将文件从磁盘上移到内存中,类似于fread().

  • 更灵活的连接,包括对非equi-joins的支持.

  • 更灵活的分组,如bootstrap样本,汇总等

我还投入时间来改进R的数据库连接器,与web apis交谈的能力 ,以及更容易 抓取HTML页面.

  • 为什么在已经有`fread()`的情况下进行快速读取?难道不会花时间更好地改善fread()或处理其他(欠发达的)事情吗? (32认同)
  • 只是旁注,我同意你的许多论点(虽然我自己更喜欢`data.table`语法),但你可以轻松使用`%>%`来管道`data.table`操作如果你不不喜欢```风格.`%>%`不是特定于'dplyr`,而是来自一个单独的包(你碰巧也是它的共同作者),所以我不确定我理解你在大多数时候想说的是什么你的**语法**段. (26认同)
  • @DavidArenburg好点.我重写了语法,希望能更清楚地说明我的要点是什么,并强调你可以在data.table中使用`%>%` (10认同)
  • `data.table`的API建立在大量滥用`[]`符号的基础上.这是它最大的力量和最大的弱点. (9认同)
  • 谢谢哈德利,这是一个有用的观点.重新缩进我通常做`DT [\n\texpression \n] [\ texpression \n]`(**[gist](https://gist.github.com/brodieG/85271ef3de82509ad036)**)其实很漂亮好.我把Arun的答案作为答案,因为他更直接地回答了我的具体问题,这些问题与语法的可访问性无关,但我认为这对于那些试图对一般情况有所了解的人来说是一个很好的答案. dplyr`和`data.table`. (5认同)
  • @BrodieG那好多了,但我还是不喜欢它. (3认同)
  • @BrodieG把*"所有"*放在`[`运算符形式**data.table查询**中的关系代数中,它可以直接类比(和深度扩展)到**SQL查询**.目前我没有看到任何`[`args冗余(除了`drop`),是的,已经有很多了:) (3认同)
  • @EDi因为fastread和fread有不同的目标.fastread还将提供C++ api,因此您可以使用相同的底层代码来读取新的文件格式 (3认同)
  • @DavidArenburg使用上例中的管道运算符,您会注意到有4个点(。),其中两个用于数据表对象本身,而另外两个则不是。这对我来说只是令人困惑。 (3认同)
  • 感谢您对大数据使用data.table的许可!仅此一个值得回答。 (2认同)

Thell.. 58

直接回答问题标题 ......

dplyr 绝对data.table不了的事情.

你的观点#3

dplyr抽象(或将)潜在的DB交互

是你自己的问题的直接答案,但没有提升到足够高的水平.dplyr对于多个数据存储机制来说,它data.table是一个真正的可扩展前端,同时也是单个数据存储机制的扩展.

将其dplyr视为后端不可知接口,所有目标都使用相同的语法,您可以随意扩展目标和处理程序.data.table从这个dplyr角度来看,是其中一个目标.

你永远不会(我希望)看到有一天data.table试图翻译你的查询来创建与磁盘或网络数据存储一起运行的SQL语句.

dplyr可能做的事情data.table不会或可能不会做.

基于内存工作的设计,data.table可能会有更难以将自身扩展到并行处理查询dplyr.


回应体内问题......

用法

对于熟悉软件包的人来说,是否需要使用一个或另一个软件包来编写分析任务更加容易(例如,需要按键的一些组合与所需的深奥水平相结合,其中每个项目的好处都是好事).

这似乎是一个平底船,但真正的答案是否定的.人们熟悉的工具似乎使用的无论是一个最熟悉的他们还是一个实际上是手头的工作是正确的.话虽如此,有时你想要提供一个特定的可读性,有时候是一个性能水平,当你需要足够高的两者时,你可能只需要另一个工具来配合你已经拥有的东西来做出更清晰的抽象.

性能

是否存在在一个包装与另一个包装中更有效地执行分析任务(即,超过2倍)的分析任务.

再一次,没有. data.table擅长在所做的一切事情中保持高效,dplyr在某些方面受到限制,因为它们在底层数据存储和注册处理程序方面受到限制.

这意味着当您遇到性能问题与data.table您可以相当肯定它在你的查询功能,如果它实际上是一个瓶颈问题data.table,那么你已经赢得了自己在提交报告的喜悦.时也是如此dplyr使用data.table作为后端; 你可能会看到一些开销,dplyr但可能是你的查询.

dplyr后端存在性能问题时,您可以通过注册混合评估函数或(在数据库的情况下)在执行之前操作生成的查询来绕过它们.

另请参阅plyr何时比data.table更好的接受答案

  • 阿伦的回答很好地解释了.最重要的(在性能方面)将是fread,通过引用更新,滚动连接,重叠连接.我相信没有任何可以与这些功能竞争的软件包(不仅仅是dplyr).一个很好的例子可以是[this](https://github.com/Rdatatable/data.table/wiki/talks/EARL2014_OverlapRangeJoin_Arun.pdf)演示文稿的最后一张幻灯片. (25认同)
  • 你忘了提到反向声明*"data.table肯定会做dplyr不能做的事情"*这也是事实. (22认同)
  • 总而言之,data.table是我仍然使用R的原因.否则我会使用熊猫.它比熊猫更好/更快. (13认同)
  • 我喜欢data.table,因为它简单且与SQL语法结构相似.我的工作涉及每天为统计建模做非常强烈的临时数据分析和图形,我真的需要简单的工具来做复杂的事情.现在我可以在我的日常工作中将我的工具包减少到数据的data.table和图形的格子.举个例子,我甚至可以做这样的操作:$ DT [group == 1,y_hat:= predict(fit1,data = .SD),] $,这真的很整洁,我认为它是SQL的一大优势经典R环境. (6认同)
  • 不能用dblyr用tbl_dt包装data.table吗?为什么不只是两全其美? (3认同)
  • @JanGorecki那是因为我不知道_capability_,即data.table具有dplyr无法直接或通过处理程序运行的能力。有_features_(如关于速度,内存和语法的讨论)已作为对问题的OP定性(_poorly_)部分的回答而被讨论,但我不记得看到_capabilities_不能/不能被概括和抽象化一层。 (2认同)

归档时间:

查看次数:

113401 次

最近记录:

1 年,6 月 前