获取行对的成对差异

Question

获取行对的成对差异

创建一个新变量，该变量是数据集中两个相邻行“价格”变量的差，其中新变量是平方差。

test <- data.frame(id = c(6, 16, 26, 36, 46, 56),
                    house = c(1, 5, 10, 23, 25, 27), 
                    price = c(79, 84, 36, 34, 21, 12))

Run Code Online (Sandbox Code Playgroud)

其中新变量是diff = (79-84)^2 , (36-34)^2, (21-12)^2 所需的输出如下所示：

diff.data <- data.frame(price_diff = c(25, 4, 81))

Run Code Online (Sandbox Code Playgroud)

我正在尝试使用括号来隔离第一行和第二行，并取差值和平方，然后对第三行和第四行等重复此操作，但感谢有关如何处理此问题的提示。

Answer 1

小智 13

我想到了一些方法。

dplyr

为了快速工作，可以使用 dplyr 方法，它具有方便lag()和lead()功能。首先对每一行进行计算，然后对每隔一行进行子集计算，然后提取计算出的列。

library(dplyr)

test %>%
    mutate(diff = (price - lead(price))^2) %>%
    slice(seq(1, nrow(.), 2)) %>%
    pull(diff)

Run Code Online (Sandbox Code Playgroud)

[1] 25  4 81

Run Code Online (Sandbox Code Playgroud)

碱基R

一种基本的 R 方法，具有有趣的循环逻辑子集。这很明显是一个向量化操作，所以自然是最快的。

(test$price[c(TRUE, FALSE)] - test$price[c(FALSE, TRUE)])^2

Run Code Online (Sandbox Code Playgroud)

[1] 25  4 81

Run Code Online (Sandbox Code Playgroud)

for循环

一种低效但有效的方法：

inds <- seq(1, nrow(test), 2)
diff <- numeric(length(inds))

for (i in seq_along(inds)) {
    diff[i] <- (test$price[inds[i]] - test$price[inds[i] + 1])^2
}

diff

Run Code Online (Sandbox Code Playgroud)

[1] 25  4 81

Run Code Online (Sandbox Code Playgroud)

基准测试

library(microbenchmark)
test_big <- data.frame(price = rnorm(100000, mean(test$price)))

res <- microbenchmark(
    dplyr = {
        diff1 <- test_big %>%
            mutate(diff = (price - lead(price))^2) %>%
            slice(seq(1, nrow(.), 2)) %>%
            pull(diff)
    },
    base = {
        diff2 <- (test_big$price[c(TRUE, FALSE)] - test_big$price[c(FALSE, TRUE)])^2
    },
    loop = {
        inds <- seq(1, nrow(test_big), 2)
        diff3 <- numeric(length(inds))
        for (i in seq_along(inds)) {
            diff3[i] <- (test_big$price[inds[i]] - test_big$price[inds[i] + 1])^2
        }
    }
)

all(c(identical(diff1, diff2), identical(diff2, diff3)))
print(res)

Run Code Online (Sandbox Code Playgroud)

[1] TRUE
Unit: microseconds
  expr       min       lq       mean     median         uq       max neval
 dplyr  1864.352  2118.08  2338.2370  2274.1060  2443.4360  3628.295   100
  base   314.306   346.04   372.9717   374.7605   391.6115   495.895   100
  loop 33623.116 34868.37 35641.7231 35273.2225 35975.5525 58852.630   100

Run Code Online (Sandbox Code Playgroud)

ggplot2::autoplot(res)

Run Code Online (Sandbox Code Playgroud)

Answer 2

小智 3

这是一种带有 for 循环的解决方案。效率不高，但可能已经足够好了。

test <- data.frame(id = c(6, 16, 26, 36, 46, 56),
                   house = c(1, 5, 10, 23, 25, 27), 
                   price = c(79, 84, 36, 34, 21, 12))

index = seq(1,nrow(test)-1,by=2)
price_diff<-c()
for(i in index){
  tmp<-test[c(i,i+1),]$price
  price_diff[i] <- (tmp[2]-tmp[1])^2
}
return(price_diff)

Run Code Online (Sandbox Code Playgroud)

归档时间：	1 年，11 月前
查看次数：	757 次
最近记录：	1 年，11 月前