sta*_*ant 8 r quantile data.table
我有一个data.table,并希望按组计算统计数据.
R) set.seed(1)
R) DT=data.table(a=rnorm(100),b=rnorm(100))
Run Code Online (Sandbox Code Playgroud)
这些群体应该由
R) quantile(DT$a,probs=seq(.1,.9,.1))
10% 20% 30% 40% 50% 60% 70% 80% 90%
-1.05265747329 -0.61386923071 -0.37534201964 -0.07670312896 0.11390916079 0.37707993057 0.58121734252 0.77125359976 1.18106507751
Run Code Online (Sandbox Code Playgroud)
我如何计算出每箱的平均值b,比如b=-.5我是否[-0.61386923071,-0.37534201964]在bin中3
怎么样 :
> DT[, mean(b), keyby=cut(a,quantile(a,probs=seq(.1,.9,.1)))]
cut V1
1: NA -0.31359818
2: (-1.05,-0.614] -0.14103182
3: (-0.614,-0.375] -0.33474492
4: (-0.375,-0.0767] 0.20827735
5: (-0.0767,0.114] 0.14890251
6: (0.114,0.377] 0.16685304
7: (0.377,0.581] 0.07086979
8: (0.581,0.771] 0.17950572
9: (0.771,1.18] -0.04951607
Run Code Online (Sandbox Code Playgroud)
为了看看NA(并检查结果),我接下来做了:
> DT[, list(mean(b),.N,list(a)), keyby=cut(a,quantile(a,probs=seq(.1,.9,.1)))]
cut V1 N V3
1: NA -0.31359818 20 1.59528080213779,1.51178116845085,-2.2146998871775,-1.98935169586337,-1.47075238389927,1.35867955152904,
2: (-1.05,-0.614] -0.14103182 10 -0.626453810742332,-0.835628612410047,-0.820468384118015,-0.621240580541804,-0.68875569454952,-0.70749515696212,
3: (-0.614,-0.375] -0.33474492 10 -0.47815005510862,-0.41499456329968,-0.394289953710349,-0.612026393250771,-0.443291873218433,-0.589520946188072,
4: (-0.375,-0.0767] 0.20827735 10 -0.305388387156356,-0.155795506705329,-0.102787727342996,-0.164523596253587,-0.253361680136508,-0.112346212150228,
5: (-0.0767,0.114] 0.14890251 10 -0.0449336090152309,-0.0161902630989461,0.0745649833651906,-0.0561287395290008,-0.0538050405829051,-0.0593133967111857,
6: (0.114,0.377] 0.16685304 10 0.183643324222082,0.329507771815361,0.36458196213683,0.341119691424425,0.188792299514343,0.153253338211898,
7: (0.377,0.581] 0.07086979 10 0.487429052428485,0.575781351653492,0.389843236411431,0.417941560199702,0.387671611559369,0.556663198673657,
8: (0.581,0.771] 0.17950572 10 0.738324705129217,0.593901321217509,0.61982574789471,0.763175748457544,0.696963375404737,0.768532924515416,
9: (0.771,1.18] -0.04951607 10 1.12493091814311,0.943836210685299,0.821221195098089,0.918977371608218,0.782136300731067,1.10002537198388,
Run Code Online (Sandbox Code Playgroud)
旁白:我已经返回一个list列(每个单元格本身就是一个向量),可以快速查看进入二进制文件的值,只是为了检查.data.table打印时显示逗号(并且每个单元格仅显示前6个项目),但V3实际上每个单元格都有一个数字向量.
因此,第一个和最后一个之外的值break被编码为NA.对我来说,如何告诉cut不要这样做并不明显.所以我刚刚添加了-Inf和+ Inf:
> DT[,list(mean(b),.N),keyby=cut(a,c(-Inf,quantile(a,probs=seq(.1,.9,.1)),+Inf))]
cut V1 N
1: (-Inf,-1.05] -0.16938368 10
2: (-1.05,-0.614] -0.14103182 10
3: (-0.614,-0.375] -0.33474492 10
4: (-0.375,-0.0767] 0.20827735 10
5: (-0.0767,0.114] 0.14890251 10
6: (0.114,0.377] 0.16685304 10
7: (0.377,0.581] 0.07086979 10
8: (0.581,0.771] 0.17950572 10
9: (0.771,1.18] -0.04951607 10
10: (1.18, Inf] -0.45781268 10
Run Code Online (Sandbox Code Playgroud)
那更好.或者:
> DT[, list(mean(b),.N), keyby=cut(a,quantile(a,probs=seq(0,1,.1)),include=TRUE)]
cut V1 N
1: [-2.21,-1.05] -0.16938368 10
2: (-1.05,-0.614] -0.14103182 10
3: (-0.614,-0.375] -0.33474492 10
4: (-0.375,-0.0767] 0.20827735 10
5: (-0.0767,0.114] 0.14890251 10
6: (0.114,0.377] 0.16685304 10
7: (0.377,0.581] 0.07086979 10
8: (0.581,0.771] 0.17950572 10
9: (0.771,1.18] -0.04951607 10
10: (1.18,2.4] -0.45781268 10
Run Code Online (Sandbox Code Playgroud)
这样你就可以看到最小值和最大值,而不是显示-Inf和+ Inf.请注意,您需要传递include=TRUE给cut其他11个箱子,第一个箱子只返回1个箱子.
| 归档时间: |
|
| 查看次数: |
3171 次 |
| 最近记录: |