这是对r-help邮件列表中提出的问题的回应.
以下是如何使用组查找顶级值的大量示例sql,因此我认为使用R sqldf包可以轻松转换该知识.
一个例子:当mtcars按时分组时cyl,这里是每个不同值的前三个记录cyl.请注意,在这种情况下排除关系,但显示处理关系的一些不同方法会很好.
mpg cyl disp hp drat wt qsec vs am gear carb ranks
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 2.0
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 1.0
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 2.0
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 3.0
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 1.0
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 1.5
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 1.5
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 3.0
Run Code Online (Sandbox Code Playgroud)
如何找到每组的顶部或底部(最大或最小)N条记录?
Aru*_*run 40
这似乎更简单,data.table因为它在设置密钥时执行排序.
所以,如果我要按排序(升序)获得前3个记录,那么,
require(data.table)
d <- data.table(mtcars, key="cyl")
d[, head(.SD, 3), by=cyl]
Run Code Online (Sandbox Code Playgroud)
可以.
如果你想降序
d[, tail(.SD, 3), by=cyl] # Thanks @MatthewDowle
Run Code Online (Sandbox Code Playgroud)
编辑:使用列来排序关系mpg:
d <- data.table(mtcars, key="cyl")
d.out <- d[, .SD[mpg %in% head(sort(unique(mpg)), 3)], by=cyl]
# cyl mpg disp hp drat wt qsec vs am gear carb rank
# 1: 4 22.8 108.0 93 3.85 2.320 18.61 1 1 4 1 11
# 2: 4 22.8 140.8 95 3.92 3.150 22.90 1 0 4 2 1
# 3: 4 21.5 120.1 97 3.70 2.465 20.01 1 0 3 1 8
# 4: 4 21.4 121.0 109 4.11 2.780 18.60 1 1 4 2 6
# 5: 6 18.1 225.0 105 2.76 3.460 20.22 1 0 3 1 7
# 6: 6 19.2 167.6 123 3.92 3.440 18.30 1 0 4 4 1
# 7: 6 17.8 167.6 123 3.92 3.440 18.90 1 0 4 4 2
# 8: 8 14.3 360.0 245 3.21 3.570 15.84 0 0 3 4 7
# 9: 8 10.4 472.0 205 2.93 5.250 17.98 0 0 3 4 14
# 10: 8 10.4 460.0 215 3.00 5.424 17.82 0 0 3 4 5
# 11: 8 13.3 350.0 245 3.73 3.840 15.41 0 0 3 4 3
# and for last N elements, of course it is straightforward
d.out <- d[, .SD[mpg %in% tail(sort(unique(mpg)), 3)], by=cyl]
Run Code Online (Sandbox Code Playgroud)
Ist*_*sta 20
只需按任意排序(例如,mpg,问题不清楚)
mt <- mtcars[order(mtcars$mpg), ]
Run Code Online (Sandbox Code Playgroud)
然后使用by函数获取每组中的前n行
d <- by(mt, mt["cyl"], head, n=4)
Run Code Online (Sandbox Code Playgroud)
如果您希望结果是data.frame:
Reduce(rbind, d)
Run Code Online (Sandbox Code Playgroud)
编辑: 处理关系更加困难,但如果需要所有联系:
by(mt, mt["cyl"], function(x) x[rank(x$mpg) %in% sort(unique(rank(x$mpg)))[1:4], ])
Run Code Online (Sandbox Code Playgroud)
另一种方法是根据其他一些信息打破关系,例如,
mt <- mtcars[order(mtcars$mpg, mtcars$hp), ]
by(mt, mt["cyl"], head, n=4)
Run Code Online (Sandbox Code Playgroud)
Aza*_*hya 12
dplyr 诀窍
mtcars %>%
arrange(desc(mpg)) %>%
group_by(cyl) %>% slice(1:2)
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
5 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
6 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Run Code Online (Sandbox Code Playgroud)
至少有 4 种方法可以做这件事,但是,每种方法都有一些不同。我们使用 u_id 进行分组并使用提升值进行排序/排序
1 dplyr 传统方式
library(dplyr)
top10_final_subset1 = final_subset %>% arrange(desc(lift)) %>% group_by(u_id) %>% slice(1:10)
Run Code Online (Sandbox Code Playgroud)
如果你切换排列(desc(lift))和group_by(u_id)的顺序,结果基本相同。如果有相等的lift值,它会切片以确保每组不超过10个值,如果您在该组中只有 5 个提升值,则它只会为您提供该组的 5 个结果。
2 dplyr topN方式
library(dplyr)
top10_final_subset2 = final_subset %>% group_by(u_id) %>% top_n(10,lift)
Run Code Online (Sandbox Code Playgroud)
这个如果你有提升值,比如说对于相同的 u_id 有 15 个相同的提升,你将得到所有 15 个观察值
3 data.table尾部方式
library(data.table)
final_subset = data.table(final_subset,key = "lift")
top10_final_subset3 = final_subset[,tail(.SD,10),,by = c("u_id")]
Run Code Online (Sandbox Code Playgroud)
它与第一种方式具有相同的行号,但是,有些行是不同的,我猜他们正在使用 diff 随机算法处理领带。
4 data.table .SD方式
library(data.table)
top10_final_subset4 = final_subset[,.SD[order(lift,decreasing = TRUE),][1:10],by = "u_id"]
Run Code Online (Sandbox Code Playgroud)
这种方式是最“统一”的方式,如果在一个组中只有 5 个观察值,它将重复值使其达到 10 个观察值,如果有关系,它仍然会切片并仅保留 10 个观察值。
# start with the mtcars data frame (included with your installation of R)
mtcars
# pick your 'group by' variable
gbv <- 'cyl'
# IMPORTANT NOTE: you can only include one group by variable here
# ..if you need more, the `order` function below will need
# one per inputted parameter: order( x$cyl , x$am )
# choose whether you want to find the minimum or maximum
find.maximum <- FALSE
# create a simple data frame with only two columns
x <- mtcars
# order it based on
x <- x[ order( x[ , gbv ] , decreasing = find.maximum ) , ]
# figure out the ranks of each miles-per-gallon, within cyl columns
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank ) )
}
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]
# look at your results
result
# done!
# but note only *two* values where cyl == 4 were kept,
# because there was a tie for third smallest, and the `rank` function gave both '3.5'
x[ x$ranks == 3.5 , ]
# ..if you instead wanted to keep all ties, you could change the
# tie-breaking behavior of the `rank` function.
# using the `min` *includes* all ties. using `max` would *exclude* all ties
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) )
}
# and there are even more options..
# see ?rank for more methods
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]
# look at your results
result
# and notice *both* cyl == 4 and ranks == 3 were included in your results
# because of the tie-breaking behavior chosen.
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
68880 次 |
| 最近记录: |