将行分成10组,每组具有相同的总值

usc*_*t01 4 r data.table

我有一个包含2列,ID和收入的数据.我想创建一个列,它将数据分成10组,每组占总收入的10%.分位数方法为我提供了10个具有相同ID数而不是收入的组.

idrev[ , decile := cut(Revenue,
                    breaks = quantile(Revenue, probs = seq(0, 1, by = 1/10)),
                    labels = 1:10, right = FALSE)]
Run Code Online (Sandbox Code Playgroud)

我得到以下类型的结果

    N   Revenue %Revenue
 100    $3,992  80%
 100    $518    10%
 100    $236    5%
 100    $126    3%
 100    $68 1%
 100    $35 1%
 100    $16 0%
 100    $6  0%
 100    $2  0%
 100    $1  0%
 1,000  $5,000  100%
Run Code Online (Sandbox Code Playgroud)

我正在寻找这个结果

    N   Revenue %Revenue
 798    500 10%
 104    500 10%
 47     500 10%
 25     500 10%
 14     500 10%
 7  500 10%
 3  500 10%
 2  500 10%
 1  500 10%
 1  500 10%
 1,000  $5,000  100%
Run Code Online (Sandbox Code Playgroud)

请在R中为此建议解决方案.

添加代码以获取样本数据和统计信息

library(Hmisc);library(data.table)
set.seed(123)
idrev<-data.table(ID=1:1000, Revenue=sample(100,1000,replace=T))
idrev[,.(.N,sum(Revenue))] #Check total revenue
idrev[ , decile := cut2(Revenue,g=10)]
idrev[,.(.N,sum(Revenue)),by=decile][order(decile)]
Run Code Online (Sandbox Code Playgroud)

lmo*_*lmo 5

这是一个data.table唯一可以让你到达那里的方法:

idrev[order(Revenue), revDec := 10 * ceiling(10 * (cumsum(Revenue) / sum(Revenue)))]
Run Code Online (Sandbox Code Playgroud)

这是按收入排序行后十分位数的直接计算.

以下是通过revDec对收入进行求和的结果:

idrev[, .(Revenue=sum(Revenue)), by="revDec"]
    revDec Revenue
 1:     10    5004
 2:     70    5070
 3:     20    5039
 4:     80    5025
 5:     90    4974
 6:     30    4974
 7:     40    5059
 8:     50    5026
 9:    100    5091
10:     60    4960
Run Code Online (Sandbox Code Playgroud)

他们都非常接近5000.