按组选择前N个值

Ant*_*ico 38 aggregate r

这是对r-help邮件列表中提出的问题的回应.

以下是如何使用组查找顶级值的大量示例sql,因此我认为使用R sqldf包可以轻松转换该知识.

一个例子:当mtcars按时分组时cyl,这里是每个不同值的前三个记录cyl.请注意,在这种情况下排除关系,但显示处理关系的一些不同方法会很好.

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb ranks
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1   2.0
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2   1.0
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1   2.0
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   3.0
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4   1.0
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4   1.5
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4   1.5
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4   3.0
Run Code Online (Sandbox Code Playgroud)

如何找到每组的顶部或底部(最大或最小)N条记录?

Aru*_*run 40

这似乎更简单,data.table因为它在设置密钥时执行排序.

所以,如果我要按排序(升序)获得前3个记录,那么,

require(data.table)
d <- data.table(mtcars, key="cyl")
d[, head(.SD, 3), by=cyl]
Run Code Online (Sandbox Code Playgroud)

可以.

如果你想降序

d[, tail(.SD, 3), by=cyl] # Thanks @MatthewDowle
Run Code Online (Sandbox Code Playgroud)

编辑:使用列来排序关系mpg:

d <- data.table(mtcars, key="cyl")
d.out <- d[, .SD[mpg %in% head(sort(unique(mpg)), 3)], by=cyl]

#     cyl  mpg  disp  hp drat    wt  qsec vs am gear carb rank
#  1:   4 22.8 108.0  93 3.85 2.320 18.61  1  1    4    1   11
#  2:   4 22.8 140.8  95 3.92 3.150 22.90  1  0    4    2    1
#  3:   4 21.5 120.1  97 3.70 2.465 20.01  1  0    3    1    8
#  4:   4 21.4 121.0 109 4.11 2.780 18.60  1  1    4    2    6
#  5:   6 18.1 225.0 105 2.76 3.460 20.22  1  0    3    1    7
#  6:   6 19.2 167.6 123 3.92 3.440 18.30  1  0    4    4    1
#  7:   6 17.8 167.6 123 3.92 3.440 18.90  1  0    4    4    2
#  8:   8 14.3 360.0 245 3.21 3.570 15.84  0  0    3    4    7
#  9:   8 10.4 472.0 205 2.93 5.250 17.98  0  0    3    4   14
# 10:   8 10.4 460.0 215 3.00 5.424 17.82  0  0    3    4    5
# 11:   8 13.3 350.0 245 3.73 3.840 15.41  0  0    3    4    3

# and for last N elements, of course it is straightforward
d.out <- d[, .SD[mpg %in% tail(sort(unique(mpg)), 3)], by=cyl]
Run Code Online (Sandbox Code Playgroud)


Ist*_*sta 20

只需按任意排序(例如,mpg,问题不清楚)

mt <- mtcars[order(mtcars$mpg), ]
Run Code Online (Sandbox Code Playgroud)

然后使用by函数获取每组中的前n行

d <- by(mt, mt["cyl"], head, n=4)
Run Code Online (Sandbox Code Playgroud)

如果您希望结果是data.frame:

Reduce(rbind, d)
Run Code Online (Sandbox Code Playgroud)

编辑: 处理关系更加困难,但如果需要所有联系:

by(mt, mt["cyl"], function(x) x[rank(x$mpg) %in% sort(unique(rank(x$mpg)))[1:4], ])
Run Code Online (Sandbox Code Playgroud)

另一种方法是根据其他一些信息打破关系,例如,

mt <- mtcars[order(mtcars$mpg, mtcars$hp), ]
by(mt, mt["cyl"], head, n=4)
Run Code Online (Sandbox Code Playgroud)


Aza*_*hya 12

dplyr 诀窍

mtcars %>% 
arrange(desc(mpg)) %>% 
group_by(cyl) %>% slice(1:2)


 mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
2  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
3  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
4  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
5  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2
6  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
Run Code Online (Sandbox Code Playgroud)


clo*_*tes 9

至少有 4 种方法可以做这件事,但是,每种方法都有一些不同。我们使用 u_id 进行分组并使用提升值进行排序/排序

1 dplyr 传统方式

library(dplyr)
top10_final_subset1 = final_subset %>% arrange(desc(lift)) %>% group_by(u_id) %>% slice(1:10)
Run Code Online (Sandbox Code Playgroud)

如果你切换排列(desc(lift))和group_by(u_id)的顺序,结果基本相同。如果有相等的lift值,它会切片以确保每组不超过10个值,如果您在该组中只有 5 个提升值,则它只会为您提供该组的 5 个结果。

2 dplyr topN方式

library(dplyr)
top10_final_subset2 = final_subset %>% group_by(u_id) %>% top_n(10,lift)
Run Code Online (Sandbox Code Playgroud)

这个如果你有提升值,比如说对于相同的 u_id 有 15 个相同的提升,你将得到所有 15 个观察值

3 data.table尾部方式

library(data.table)
final_subset = data.table(final_subset,key = "lift")
top10_final_subset3 = final_subset[,tail(.SD,10),,by = c("u_id")]
Run Code Online (Sandbox Code Playgroud)

它与第一种方式具有相同的行号,但是,有些行是不同的,我猜他们正在使用 diff 随机算法处理领带。

4 data.table .SD方式

library(data.table)
top10_final_subset4 = final_subset[,.SD[order(lift,decreasing = TRUE),][1:10],by = "u_id"]
Run Code Online (Sandbox Code Playgroud)

这种方式是最“统一”的方式,如果在一个组中只有 5 个观察值,它将重复值使其达到 10 个观察值,如果有关系,它仍然会切片并仅保留 10 个观察值。


Ant*_*ico 1

# start with the mtcars data frame (included with your installation of R)
mtcars

# pick your 'group by' variable
gbv <- 'cyl'
# IMPORTANT NOTE: you can only include one group by variable here
# ..if you need more, the `order` function below will need
# one per inputted parameter: order( x$cyl , x$am )

# choose whether you want to find the minimum or maximum
find.maximum <- FALSE

# create a simple data frame with only two columns
x <- mtcars

# order it based on 
x <- x[ order( x[ , gbv ] , decreasing = find.maximum ) , ]

# figure out the ranks of each miles-per-gallon, within cyl columns
if ( find.maximum ){
    # note the negative sign (which changes the order of mpg)
    # *and* the `rev` function, which flips the order of the `tapply` result
    x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank ) ) )
} else {
    x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank ) )
}
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]

# look at your results
result

# done!

# but note only *two* values where cyl == 4 were kept,
# because there was a tie for third smallest, and the `rank` function gave both '3.5'
x[ x$ranks == 3.5 , ]

# ..if you instead wanted to keep all ties, you could change the
# tie-breaking behavior of the `rank` function.
# using the `min` *includes* all ties.  using `max` would *exclude* all ties
if ( find.maximum ){
    # note the negative sign (which changes the order of mpg)
    # *and* the `rev` function, which flips the order of the `tapply` result
    x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) ) )
} else {
    x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) )
}
# and there are even more options..
# see ?rank for more methods

# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]

# look at your results
result
# and notice *both* cyl == 4 and ranks == 3 were included in your results
# because of the tie-breaking behavior chosen.
Run Code Online (Sandbox Code Playgroud)

  • 好的,积分已取。抱歉投了反对票。我认为没有撤消按钮... (3认同)
  • 对于如此简单的任务来说,这太复杂了! (2认同)