我有一个按3个不同变量分组的数据列表,以及一个测量变量,如下所示.ID1 =因子,ID2 = int,ID3 =数字,varX =数字
ID1 ID2 ID3 varX
A 1 0.1 40.0
A 1 0.8 70.5
A 2 0.7 55.0
A 2 0.8 65.0
A 2 1.0 60.0
B 4 0.2 70.0
B 5 0.6 55.7
C 1 0.1 55.0
C 1 0.3 90.0
C 1 0.9 60.0
C 5 0.8 45.5
C 5 0.9 30.0
Run Code Online (Sandbox Code Playgroud)
我想将varX的每个值更新为按ID1和ID2分组的最小值,但也要更新为ID3,其中在确定最小值时仅考虑ID3值大于或等于当前行的行.
例如:对于ID1 = A,ID2 = 2,ID3 = 0.7,varX将是55.0,65.0和60.0的最小值.而对于ID1 = A,ID2 = 2,ID3 = 0.8,varX将是65.0和60.0的最小值.
生成的表格如下所示:
ID1 ID2 ID3 varX
A 1 0.1 40.0
A 1 0.8 70.5
A 2 0.7 55.0
A 2 0.8 60.0
A 2 1.0 60.0
B 4 0.2 70.0
B 5 0.6 55.7
C 1 0.1 55.0
C 1 0.3 60.0
C 1 0.9 60.0
C 5 0.8 30.0
C 5 0.9 30.0
Run Code Online (Sandbox Code Playgroud)
我有这种格式的36,000行数据,因此性能相对重要
这是一种更详细的dplyr
方法,可能足够快(1 秒处理您格式的 100 万行)。
library(dplyr)
df2 <- df %>%
tibble::rowid_to_column() %>% # to use later to put back in original order
group_by(ID1, ID2) %>%
arrange(-ID3) %>% # starting with the largest ID3 within each group and working down...
mutate(varX2 = cummin(varX)) %>% # what's the min varX encountered so far?
ungroup() %>%
arrange(rowid) # put back in original order
Run Code Online (Sandbox Code Playgroud)
这是我测试过的假数据:
n = 1000000
df <- data_frame(
ID1 = sample(LETTERS[1:26], size = n, replace = T),
ID2 = sample(1:100, size = n, replace = T),
ID3 = sample(0.1*1:10, size = n, replace = T),
varX = rnorm(n, 50, 30))
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
105 次 |
最近记录: |