按组累计最小值和最大值

Question

按组累计最小值和最大值

我正在尝试计算 R 中数据帧的最小值。数据帧如下所示：

+-----+--------------+-----------+------+------+
| Key | DaysToEvent  | PriceEUR  | Pmin | Pmax |
+-----+--------------+-----------+------+------+
| AAA | 120          |        50 |   50 |   50 |
| AAA | 110          |        40 |   40 |   50 |
| AAA | 100          |        60 |   40 |   60 |
| BBB | ...          |           |      |      |
+-----+--------------+-----------+------+------+

Run Code Online (Sandbox Code Playgroud)

因此，最低价格范围 ( Pmin) 保留该键到该时间点 ( DaysToEvent) 的最低价格。

这是我的实现：

for (i in 1:nrow(data)){
  currentRecord <- data[i,]

  if(currentRecord$Key != currentKey) {
    # New key detected - reset pmin and pmax
    pmin <- 100000
    pmax <- 0
    currentKey <- currentRecord$Key
  }

  if(currentRecord$PriceEUR < pmin) {
    pmin <- currentRecord$PriceEUR
  }
  if(currentRecord$PriceEUR > pmax) {
    pmax <- currentRecord$PriceEUR
  }

  currentRecord$Pmin <- pmin
  currentRecord$Pmax <- pmax

  # This line seems to be killing my performance
  # but otherwise the data variable is not updated in
  # global space
  data[i,] <- currentRecord
}

Run Code Online (Sandbox Code Playgroud)

这可行 - 但真的很慢，每秒只有几个。它有效，因为我已经像这样对数据框进行了排序data = data[order(data$Key, -data$DaysToEvent), ]。nlog(n)这样做的原因是因为我希望在排序和nfor 循环中得到一个 Big-O 。所以我以为我会快速浏览这些数据，但我根本没有——需要几个小时。

我怎样才能让它更快？

以前的方法来自我的同事 - 这里是伪的：

for (i in 1:nrow(data)) {
    ...
    currentRecord$Pmin <- data[subset on the key[find the min value of the price 
                      where DaysToEvent > currentRecord$DaysToEvent]]
    ...
}

Run Code Online (Sandbox Code Playgroud)

也有效 - 但我认为这个函数的顺序要高得多。n^2log(n)如果我是对的并且需要几天时间。所以我想在那段时间我会取得进步。

因此，我尝试了解各种函数*apply，by当然这才是您真正想要使用的。

但是-如果我使用by()然后拆分键。让我很接近。但是，我无法解决如何获得最小/最大范围。我试图用函数范式来思考，但我陷入了困境。任何帮助表示赞赏。

Answer 1

Mar*_*pov 5

[原答案：dplyr]

您可以使用以下dplyr包解决此问题：

library(dplyr)
d %>% 
  group_by(Key) %>% 
  mutate(Pmin=cummin(PriceEUR),Pmax=cummax(PriceEUR))

#   Key DaysToEvent PriceEUR Pmin Pmax
# 1 AAA         120       50   50   50
# 2 AAA         110       40   40   50
# 3 AAA         100       60   40   60
# 4 BBB         100       50   50   50

Run Code Online (Sandbox Code Playgroud)

哪里d应该是你的数据集：

d <- data.frame(Key=c('AAA','AAA','AAA','BBB'),DaysToEvent = c(120,110,100,100),PriceEUR = c(50,40,60,50), Pmin = c(50,40,40,30), Pmax = c(50,50,60,70))

Run Code Online (Sandbox Code Playgroud)

[更新：数据表]

另一种方法是使用data.table，它具有相当惊人的性能：

library(data.table)
DT <- setDT(d)
DT[,c("Pmin","Pmax") := list(cummin(PriceEUR),cummax(PriceEUR)),by=Key]

DT
#    Key DaysToEvent PriceEUR Pmin Pmax
# 1: AAA         120       50   50   50
# 2: AAA         110       40   40   50
# 3: AAA         100       60   40   60
# 4: BBB         100       50   50   50

Run Code Online (Sandbox Code Playgroud)

[更新2：基础R]

如果您出于某种原因只想使用基本 R，这里有另一种方法：

d$Pmin <- unlist(lapply(split(d$PriceEUR,d$Key),cummin))
d$Pmax <- unlist(lapply(split(d$PriceEUR,d$Key),cummax))

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，2 月前
查看次数：	89 次
最近记录：	3 年，8 月前