sto*_*eel 6 r dataframe dplyr data.table
假设我有一个数据框,可以跟踪行间值的变化。如果该值增加,则变化为“正”。如果该值减少,则变化为“负”。否则,更改列显示“无更改”
df = data.frame(sequence = 1:10, value = c(1,1,1, 2, 2,2,2,1,1,1), change = c('no change', 'no change', 'no change', 'positive', 'no change', 'no change', 'no change', 'negative', 'no change','no change'))
Run Code Online (Sandbox Code Playgroud)
我想创建一个新列来跟踪值列中发生的最后更改的性质。在这种情况下,最终的数据框应如下所示
df = data.frame(sequence = 1:10, value = c(1,1,1, 2, 2,2,2,1,1,1), change = c('no change', 'no change', 'no change', 'positive', 'no change', 'no change', 'no change', 'negative', 'no change','no change')), last_change = c('no change', 'no change', 'no change', 'positive', 'positive', 'positive', 'positive', 'negative', 'negative', 'negative')
Run Code Online (Sandbox Code Playgroud)
我该怎么做呢?这是示例数据 - 我的实际数据包含 300 万行。任何帮助将非常感激。
您可以通过利用data.table::rleid
library(data.table)
setDT(df)[, last_change:=first(change), rleid(value)]
Run Code Online (Sandbox Code Playgroud)
输出:
sequence value change last_change
<int> <num> <char> <char>
1: 1 1 no change no change
2: 2 1 no change no change
3: 3 1 no change no change
4: 4 2 positive positive
5: 5 2 no change positive
6: 6 2 no change positive
7: 7 2 no change positive
8: 8 1 negative negative
9: 9 1 no change negative
10: 10 1 no change negative
Run Code Online (Sandbox Code Playgroud)
一行dplyr,没有分组(更快,参见基准):
library(dplyr)
df |>
mutate(last_change = change[x <- lag(value, default = 0) != value][cumsum(x)])
Run Code Online (Sandbox Code Playgroud)
输出
sequence value change last_change
1 1 1 no change no change
2 2 1 no change no change
3 3 1 no change no change
4 4 2 positive positive
5 5 2 no change positive
6 6 2 no change positive
7 7 2 no change positive
8 8 1 negative negative
9 9 1 no change negative
10 10 1 no change negative
Run Code Online (Sandbox Code Playgroud)
由于问题意味着这将在大型数据集上完成,因此我使用具有 1,000,000 行的数据框对两个解决方案进行了基准测试:
library(dplyr)
library(data.table)
df2 <- do.call("rbind", replicate(100000, df, simplify = FALSE))
microbenchmark(
f1 = mutate(df2,
last_change = change[x <- lag(value, default = 0) != value][cumsum(x)]),
f2 = setDT(df2)[, last_change:=first(change), rleid(value)]
)
Unit: milliseconds
expr min lq mean median uq max neval
f1 27.2198 34.2597 47.15247 39.8145 49.72065 155.3829 100
f2 767.8063 908.2213 996.29603 964.0844 1019.84905 1595.0273 100
Run Code Online (Sandbox Code Playgroud)
我猜第一个函数(这里提出的函数)要快得多,因为它不执行任何分组。
解释
x <- lag(value, default = 0) != value给出一个逻辑向量,表示该值是否与前一个值不同(长度10),然后给出当为(长度3)时change[]的值的向量,然后根据(长度10)的模式给出 的值,这基本上是最新更改值的位置。changexTRUEchange[x][cumsum(x)]change[x]cumsum(x)
(x <- with(df, lag(value, default = 0) != value))
# [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
with(df, cumsum(x))
# [1] 1 1 1 2 2 2 2 3 3 3
with(df, change[x])
#[1] "no change" "positive" "negative"
with(df, change[x][cumsum(x)])
#[1] "no change" "no change" "no change" "positive" "positive" "positive" "positive" "negative" "negative" "negative"
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
136 次 |
| 最近记录: |