Bro*_*ieG 719 r dplyr data.table
我比较熟悉data.table,而不是那么熟悉dplyr.我已经阅读了一些出现在SO上的dplyr小插曲和例子,到目前为止我的结论是:
data.table并且dplyr在速度上具有可比性,除非有许多(即> 10-100K)组,并且在某些其他情况下(参见下面的基准)dplyr 有更多可访问的语法dplyr 摘要(或将)潜在的DB交互在我看来2.没有多大的重量,因为我对它很熟悉data.table,虽然我明白对于那些对这两者都不熟悉的用户来说这将是一个很重要的因素.我想避免争论哪个更直观,因为这与我从已经熟悉的人的角度提出的具体问题无关data.table.我还想避免讨论"更直观"如何导致更快的分析(当然是真的,但同样,不是我最感兴趣的).
我想知道的是:
最近的一个问题让我更多地思考这个问题,因为直到那时我才认为dplyr会提供超出我已经做过的东西data.table.这是dplyr解决方案(Q末尾的数据):
dat %.%
  group_by(name, job) %.%
  filter(job != "Boss" | year == min(year)) %.%
  mutate(cumu_job2 = cumsum(job2))
这比我的黑客尝试data.table解决方案要好得多.也就是说,好的data.table解决方案也相当不错(感谢Jean-Robert,Arun,并注意到这里我赞成对最严格的最佳解决方案的单一陈述):
setDT(dat)[,
  .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], 
  by=list(id, job)
]
后者的语法可能看起来非常深奥,但如果你习惯了data.table(即不使用一些更深奥的技巧),它实际上非常简单.
理想情况下,我希望看到的是一些很好的例子,dplyr或者data.table方式更简洁或表现更好.
dplyr不允许返回任意行数的分组操作(来自eddi的问题,请注意:这看起来它将在dplyr 0.5中实现,同样,@ beginneR显示了do在@ eddi问题的答案中使用的潜在解决方法).data.table支持滚动连接(感谢@dholstius)以及重叠连接data.table内部优化形式的表达式DT[col == value]或DT[col %in% values]对于速度通过自动索引,它使用二进制搜索,同时使用相同的基础R语法.请参阅此处了解更多详细信息和一个小基准.dplyr提供的功能(例如,标准的评估版regroup,summarize_each_),可以简化程序中使用的dplyr(请注意程序中使用的data.table是绝对有可能的,只是需要一些认真思考,置换/报价,等等,至少据我所知)data.table变得非常快.data.table扩展比dplyr组数增加更好(更新了包和最近版本的R中的最新增强功能).此外,试图获得在基准唯一值有data.table〜6倍的速度更快.data.table在较大版本的群组/应用/排序中速度提高75%,而较小版本的dplyr速度提高40%(另一个问题来自评论,感谢danas).data.table已对基于分组操作进行了基准测试data.table,dplyr并且pandas在高达20亿行(内存中约为100GB)上进行了python 测试.data.table〜快8倍这是我在问题部分中展示的第一个例子.
dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", 
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", 
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", 
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", 
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", 
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, 
-16L))
Aru*_*run 495
We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed, Memory usage, Syntax and Features.
My intent is to cover each one of these as clearly as possible from data.table perspective.
Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.
The data.table syntax is consistent in its form - DT[i, j, by]. To keep i, j and by together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax.
Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas. See also updated benchmarks, which include Spark and pydatatable as well.
On benchmarks, it would be great to cover these remaining aspects as well:
Grouping operations involving a subset of rows - i.e., DT[x > val, sum(y), by = z] type operations.
Benchmark other operations such as update and joins.
Also benchmark memory footprint for each operation in addition to runtime.
Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). See this post. 
Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.
data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).
# sub-assign by reference, updates 'y' in-place
DT[x >= 1L, y := NA]
But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):
# copies the entire 'y' column
ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.
Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:
foo <- function(DT) {
    DT = shallow(DT)          ## shallow copy DT
    DT[, newcol := 1L]        ## does not affect the original DT 
    DT[x > 2L, newcol := 2L]  ## no need to copy (internally), as this column exists only in shallow copied DT
    DT[x > 2L, x := 3L]       ## have to copy (like base R / dplyr does always); otherwise original DT will 
                              ## also get modified.
}
By not using shallow(), the old functionality is retained:
bar <- function(DT) {
    DT[, newcol := 1L]        ## old behaviour, original DT gets updated by reference
    DT[x > 2L, x := 3L]       ## old behaviour, update column x in original DT.
}
通过创建一个浅拷贝使用shallow(),我们明白,您不想修改原来的对象.我们在内部处理所有事情,以确保在确保复制列时仅在绝对必要时修改.实施时,这应该完全解决参考透明度问题,同时为用户提供两种可能性.
另外,一旦
shallow()导出,dplyr的data.table接口应该避免几乎所有的副本.所以那些喜欢dplyr语法的人可以将它与data.tables一起使用.但它仍然缺少data.table提供的许多功能,包括(sub)-assignment by reference.
加入时聚合:
假设您有两个data.tables如下:
DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
#    x y z
# 1: 1 a 1
# 2: 1 a 2
# 3: 1 b 3
# 4: 1 b 4
# 5: 2 a 5
# 6: 2 a 6
# 7: 2 b 7
# 8: 2 b 8
DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
#    x y mul
# 1: 1 a   4
# 2: 2 b   3
并且您希望sum(z) * mul在DT2按列加入时获取每一行x,y.我们可以:
1)聚合DT1得到sum(z),2)执行连接和3)乘(或)
# data.table way
DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]
# dplyr equivalent
DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
    right_join(DF2) %>% mutate(z = z * mul)
2)一次完成(使用by = .EACHI功能):
DT1[DT2, list(z=sum(z) * mul), by = .EACHI]
有什么好处?
我们不必为中间结果分配内存.
我们没有两次分组/哈希(一个用于聚合,另一个用于加入).
更重要的是,通过查看j(2),我们想要执行的操作是清楚的.
查看此帖子以获取详细说明by = .EACHI.没有实现中间结果,并且连接+聚合一次性完成.
Have a look at this, this and this posts for real usage scenarios.
In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).
Update and joins:
Consider the data.table code shown below:
DT1[DT2, col := i.mul]
adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary.
Check this post for a real usage scenario.
To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!
Let's now look at syntax. Hadley commented here:
Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...
I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.
We will work with the dummy data shown below:
DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)
Basic aggregation/update operations.
# case (a)
DT[, sum(y), by = z]                       ## data.table syntax
DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
DT[, y := cumsum(y), by = z]
ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))
# case (b)
DT[x > 2, sum(y), by = z]
DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
DT[x > 2, y := cumsum(y), by = z]
ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))
# case (c)
DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]
DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])
DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]
DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])
data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).
In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum.
This is what we mean when we say the DT[i, j, by] form is consistent.
Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value. 
While it returns the same result, using filter() here makes the actual operation less obvious.
It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to.
Aggregation/update on multiple columns
# case (a)
DT[, lapply(.SD, sum), by = z]                     ## data.table syntax
DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
DT[, (cols) := lapply(.SD, sum), by = z]
ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))
# case (b)
DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]
DF %>% group_by(z) %>% summarise_each(funs(sum, mean))
# case (c)
DT[, c(.N, lapply(.SD, sum)), by = z]     
DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs(). 
data.table's := requires column names to be provided, whereas dplyr generates it automatically.
In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.
In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list.
Note: Once again, in data.table, all we need to do is return a list in
j. Each element of the list will become a column in result. You can usec(),as.list(),lapply(),list()etc... base functions to accomplish this, without having to learn any new functions.You will need to learn just the special variables -
.Nand.SDat least. The equivalent in dplyr aren()and.
Joins
dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative.
setkey(DT1, x, y)
# 1. normal join
DT1[DT2]            ## data.table syntax
left_join(DT2, DT1) ## dplyr syntax
# 2. select columns while join    
DT1[DT2, .(z, i.mul)]
left_join(select(DT2, x, y, mul), select(DT1, x, y, z))
# 3. aggregate while join
DT1[DT2, .(sum(z) * i.mul), by = .EACHI]
DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
    inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)
# 4. update while join
DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]
??
# 5. rolling join
DT1[DT2, roll = -Inf]
??
# 6. other arguments to control output
DT1[DT2, mult = "first"]
??
Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R. 
However dplyr joins do just that. Nothing more. Nothing less.
data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.
data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. Why materialse the entire join result to add/update just a few columns?
data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.
data.table also has mult = argument which selects first, last or all matches (6).
data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.
Once again, the syntax is consistent with
DT[i, j, by]with additional arguments allowing for controlling the output further.
do()...
dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value. 
DT[, list(x[1], y[1]), by = z]                 ## data.table syntax
DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
DT[, list(x[1:2], y[1]), by = z]
DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))
DT[, quantile(x, 0.25), by = z]
DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
DT[, quantile(x, c(0.25, 0.75)), by = z]
DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))
DT[, as.list(summary(x)), by = z]
DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
.SD's equivalent is .
In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column.
In dplyr, cannot do that. Have to resort to do() depending on how sure you are as to whether your function would always return a single value. And it is quite slow. 
Once again, data.table's syntax is consistent with
DT[i, j, by]. We can just keep throwing expressions injwithout having to worry about these things.
Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...
To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.
data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.
But one should also consider the number of features that dplyr lacks in comparison to data.table.
I have pointed out most of the features here and also in this post. In addition:
fread - fast file reader has been available for a long time now.
fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.
Automatic indexing - another handy feature to optimise base R syntax as is, internally.
Ad-hoc grouping: dplyr automatically sorts the results by grouping variables during summarise(), which may not be always desirable.
Numerous advantages in data.table joins (for speed/memory efficiency and syntax) mentioned above.
Non-equi joins: Allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins.
Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.
setorder() function in data.table that allows really fast reordering of data.tables by reference.
dplyr provides interface to databases using the same syntax, which data.table does not at the moment.
data.table provides faster equivalents of set operations (written by Jan Gorecki) - fsetdiff, fintersect, funion and fsetequal with additional all argument (as in SQL).
data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. dplyr changes base functions filter, lag and [ which can cause problems; e.g. here and here.
Finally:
On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.
On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).
OpenMP.had*_*ley 370
这是我从dplyr角度全面回答的尝试,遵循Arun回答的大致轮廓(但根据不同的优先级进行了一些重新安排).
语法有一些主观性,但我坚持认为data.table的简洁使得学习更难学,更难阅读.这部分是因为dplyr解决了一个更容易的问题!
dplyr为您做的一件非常重要的事情是它 限制了您的选择.我声称大多数单表问题都可以通过五个关键动词过滤,选择,变异,排列和总结以及"按组"副词来解决.当您学习数据操作时,这种约束是一个很大的帮助,因为它有助于您对问题的思考.在dplyr中,每个动词都映射到一个函数.每个功能都可以完成一项工作,并且易于理解.
您可以通过与这些管道的简单操作共创复杂
 %>%.以下是Arun 链接到的帖子中的一个示例:
diamonds %>%
  filter(cut != "Fair") %>%
  group_by(cut) %>%
  summarize(
    AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = n()
  ) %>%
  arrange(desc(Count))
即使你以前从未见过dplyr(甚至是R!),你仍然可以得到正在发生的事情的要点,因为这些函数都是英语动词.英语动词的缺点是它们需要更多的打字
 [,但我认为通过更好的自动完成可以在很大程度上减轻这种情况.
这是等效的data.table代码:
diamondsDT <- data.table(diamonds)
diamondsDT[
  cut != "Fair", 
  .(AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = .N
  ), 
  by = cut
][ 
  order(-Count) 
]
除非您已经熟悉data.table,否则更难遵循此代码.(我也无法弄清楚如何以[
一种看起来很好看的方式缩进重复).就个人而言,当我看到我6个月前写的代码时,就像查看陌生人编写的代码一样,所以我更喜欢简单明了的代码.
我认为其他两个次要因素会略微降低可读性:
由于几乎每个数据表操作都使用[您需要额外的上下文来确定发生了什么.例如,是x[y]
从数据框连接两个数据表还是从中提取列?这只是一个小问题,因为在编写良好的代码中,变量名称应该表明发生了什么.
我喜欢这group_by()是dplyr中的一个单独操作.它从根本上改变了计算,所以我认为在略读代码时应该是显而易见的,并且它group_by()比by参数更容易发现[.data.table.
我也喜欢管道 
不仅限于一个包装.您可以先用tidyr整理数据,然后用
 ggvis中的绘图完成.而且你不仅限于我写的软件包 - 任何人都可以编写一个函数来构成数据操作管道的无缝部分.实际上,我更喜欢以前用%>%以下内容重写的data.table代码:
diamonds %>% 
  data.table() %>% 
  .[cut != "Fair", 
    .(AvgPrice = mean(price),
      MedianPrice = as.numeric(median(price)),
      Count = .N
    ), 
    by = cut
  ] %>% 
  .[order(-Count)]
管道的概念%>%不仅限于数据框架,而且很容易推广到其他环境:交互式网页图形,网页抓取,
 要点,运行时间合同 ......)
我把它们混在一起,因为对我来说,它们并不那么重要.大多数R用户的工作量低于100万行,而dplyr足够快,足以满足您不了解处理时间的数据量.我们优化dplyr以表达对中等数据的表达; 随时可以使用data.table获取更大数据的原始速度.
dplyr的灵活性还意味着您可以使用相同的语法轻松调整性能特征.如果dplyr与数据帧后端的性能不够好,则可以使用data.table后端(尽管功能有限).如果您使用的数据不适合内存,则可以使用数据库后端.
总而言之,dplyr的表现会在长期内变得更好.我们肯定会实现data.table的一些好主意,比如基数排序和联接和过滤器使用相同的索引.我们还致力于并行化,因此我们可以利用多个内核.
我们计划在2015年开展的一些工作:
该readr软件包,可以轻松地将文件从磁盘上移到内存中,类似于fread().
更灵活的连接,包括对非equi-joins的支持.
更灵活的分组,如bootstrap样本,汇总等
我还投入时间来改进R的数据库连接器,与web apis交谈的能力 ,以及更容易 抓取HTML页面.
The*_*ell 58
dplyr 绝对做data.table不了的事情.你的观点#3
dplyr抽象(或将)潜在的DB交互
是你自己的问题的直接答案,但没有提升到足够高的水平.dplyr对于多个数据存储机制来说,它data.table是一个真正的可扩展前端,同时也是单个数据存储机制的扩展.
将其dplyr视为后端不可知接口,所有目标都使用相同的语法,您可以随意扩展目标和处理程序.data.table从这个dplyr角度来看,是其中一个目标.
你永远不会(我希望)看到有一天data.table试图翻译你的查询来创建与磁盘或网络数据存储一起运行的SQL语句.
dplyr可能做的事情data.table不会或可能不会做.基于内存工作的设计,data.table可能会有更难以将自身扩展到并行处理查询dplyr.
对于熟悉软件包的人来说,是否需要使用一个或另一个软件包来编写分析任务更加容易(例如,需要按键的一些组合与所需的深奥水平相结合,其中每个项目的好处都是好事).
这似乎是一个平底船,但真正的答案是否定的.人们熟悉的工具似乎使用的无论是一个最熟悉的他们还是一个实际上是手头的工作是正确的.话虽如此,有时你想要提供一个特定的可读性,有时候是一个性能水平,当你需要足够高的两者时,你可能只需要另一个工具来配合你已经拥有的东西来做出更清晰的抽象.
是否存在在一个包装与另一个包装中更有效地执行分析任务(即,超过2倍)的分析任务.
再一次,没有.  data.table擅长在其所做的一切事情中保持高效,dplyr在某些方面受到限制,因为它们在底层数据存储和注册处理程序方面受到限制.
这意味着当您遇到性能问题与data.table您可以相当肯定它在你的查询功能,如果它是实际上是一个瓶颈问题data.table,那么你已经赢得了自己在提交报告的喜悦.时也是如此dplyr使用data.table作为后端; 你可能会看到一些开销,dplyr但可能是你的查询.
当dplyr后端存在性能问题时,您可以通过注册混合评估函数或(在数据库的情况下)在执行之前操作生成的查询来绕过它们.
另请参阅plyr何时比data.table更好的接受答案?
Iya*_*Lin 12
阅读 Hadley 和 Arun 的回答,你会觉得那些喜欢dplyr's 语法的人在某些情况下会切换到data.table或妥协以延长运行时间。
但正如一些人已经提到的,dplyr可以data.table用作后端。这是使用dtplyr最近发布的 1.0.0版软件包完成的。学习dtplyr几乎不需要额外的努力。
使用时dtplyr使用函数lazy_dt()声明一个惰性数据表,然后使用标准dplyr语法来指定对它的操作。这将类似于以下内容:
new_table <- mtcars2 %>% 
  lazy_dt() %>%
  filter(wt < 5) %>% 
  mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
  group_by(cyl) %>% 
  summarise(l100k = mean(l100k))
  new_table
#> Source: local data table [?? x 2]
#> Call:   `_DT1`[wt < 5][, `:=`(l100k = 235.21/mpg)][, .(l100k = mean(l100k)), 
#>     keyby = .(cyl)]
#> 
#>     cyl l100k
#>   <dbl> <dbl>
#> 1     4  9.05
#> 2     6 12.0 
#> 3     8 14.9 
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
该new_table对象没有被评估,直到要求它as.data.table()/ as.data.frame()/as_tibble()在该点处底层data.table执行操作。
我重新创建了data.table作者 Matt Dowle 于 2018 年 12 月完成的基准分析,其中涵盖了对大量组进行操作的情况。我发现这dtplyr确实使大多数喜欢dplyr语法的人在享受data.table.
| 归档时间: | 
 | 
| 查看次数: | 113401 次 | 
| 最近记录: |