我正在使用一个包含许多组(+ 2mil)的模拟数据集,其中我想计算每个组的观察总数和高于阈值(此处为2)的观察数.
当我创建一个标志变量时,它似乎要快得多,特别是对于dplyr
和更快一点data.table
.
为什么会这样?在每种情况下,它如何在后台运行?
查看下面的示例.
模拟数据集
# create an example dataset
set.seed(318)
N = 3000000 # number of rows
dt = data.frame(id = sample(1:5000000, N, replace = T),
value = runif(N, 0, 10))
Run Code Online (Sandbox Code Playgroud)
使用dplyr
library(dplyr)
# calculate summary variables for each group
t = proc.time()
dt2 = dt %>% group_by(id) %>% summarise(N = n(),
N2 = sum(value > 2))
proc.time() - t
# user system elapsed
# 51.70 0.06 52.11
# calculate summary variables for each …
Run Code Online (Sandbox Code Playgroud) 我想它与那个故事有关:在加载dplyr包时改变stats :: lag的行为,但是lag
当我尝试使用该default =
选项时,我发现了一些奇怪的函数行为.
检查下面的简单命令
library(dplyr)
df = data.frame(mtcars)
df %>% mutate(lag_cyl = lag(cyl))
## it works with NA in first value (as expected)
df %>% mutate(lag_cyl = lag(cyl, default = 999))
## it works with a given value as default
df %>% mutate(lag_cyl = lag(cyl, default = cyl[1]))
## it DOESN'T WORK with the first value of the column as default
df %>% mutate(lag_cyl = dplyr::lag(cyl, default = cyl[1]))
## it works when …
Run Code Online (Sandbox Code Playgroud) 这是一个更大规模的问题的简化版本.目标是使用data.table
结构和dplyr
命令更快地对多列进行排序和分组.
正确的版本如下:
library(dplyr)
library(data.table)
library(dtplyr)
library(lubridate)
# data set
dt = data.frame(id = c("a","b", "a"),
date = ymd(c("2016-01-03","2016-01-02","2016-01-01")),
value = c(10,5,9), stringsAsFactors = F)
# process to get the id of the largest value
(setDT(dt, key=c("id","value")) %>% select(id,value) %>% arrange(desc(value)) %>% slice(1))$id -> picked_id
# return all rows of this id
dt %>% filter(id %in% picked_id)
# id date value
# 1: a 2016-01-01 9
# 2: a 2016-01-03 10
Run Code Online (Sandbox Code Playgroud)
但是当我尝试setDT
在我的脚本中使用不同的位置时,我得到了不同的结果:
dt = data.frame(id …
Run Code Online (Sandbox Code Playgroud) 我有这样的数据,df_Filtered
:
Product Relative_Value
Car 0.12651458
Plane 0.08888552
Tank 0.03546231
Bike 0.06711630
Train 0.06382191
Run Code Online (Sandbox Code Playgroud)
我想在GGplot2中制作数据的条形图:
ggplot(df_Filtered, aes(x = Product, y = Relative_Value, fill = Product)) +
scale_y_continuous(labels = scales::percent) +
geom_bar(stat = "identity") +
theme_bw() +
theme(plot.background = element_rect(colour = "black", size = 1)) +
theme(legend.position = "none") +
theme(plot.title = element_text(hjust = 0.5))
labs(x ="Product", y = "Percentage of total sell", title = "Japan 2010") +
theme(panel.grid.major = element_blank())
Run Code Online (Sandbox Code Playgroud)
如何摆脱图表中y轴的小数?所以它说20 %
而不是20.0 %
?
我的 df 看起来像这样:
df <- data.frame(date = c('2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01'),
"alo" = c(10, 11, 12.5, 9),
"bor" = c(18, 20, 23, 19),
"car" = c(100, 125, 110, 102)) %>%
gather(-date, key = "key", value = "value")
Run Code Online (Sandbox Code Playgroud)
我想将 alo 和 bor 列绘制为同一图上的两个条形图,因此我收集了 df。但是,我希望汽车列作为折线图而不是同一图上的条形图。
目前,我的绘图代码是:
ggplot(df, aes(date, value, fill = key)) +
geom_bar(stat = 'identity', position = "dodge")
Run Code Online (Sandbox Code Playgroud)
请建议我如何为第三列而不是条添加折线图。谢谢!