R 中的百分位数结果与 MS Excel 不匹配

equ*_*ity 1 excel r subset data.table

我有以下玩具数据集(实际数据集约为 500,000 条记录):

library(data.table)

dt <- data.table(Address = c("Gold", "Gold", "Silver", "Silver", "Gold", "Gold", "Copper", "Gold", "Bronze"),
                 Name = c("Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1"), 
                 AvgValue = c(0, 0.5, 1.25, 0.75, 1.5, 0.7, 0.41, 0.83, 2.58),
                 Samples = c(123, 233, 504, 3, 94, 50, 401, 402, 12))
Run Code Online (Sandbox Code Playgroud)

我想做以下事情:

a) 对数据进行子集化,以便我们只考虑“黄金”记录”“值”列中大于零的值

b) 使用上面“a”中过滤后的数据,打印出百分位数和其他描述性统计数据。

执行上面“a”和“b”的代码如下:

qs = dt[AvgValue > 0 & Address %like% 'Gold', 
        .(Samples = sum(Samples),
          '25th'    = quantile(AvgValue, probs = c(0.25)),
          '50th'    = quantile(AvgValue, probs = c(0.50)),
          '75th'    = quantile(AvgValue, probs = c(0.75)),
          '95th'    = quantile(AvgValue, probs = c(0.95)),
          '99th'    = quantile(AvgValue, probs = c(0.99)),
          '99.9th'  = quantile(AvgValue, probs = c(0.999)), 
          '99.99th' = quantile(AvgValue, probs = c(0.9999)),
          'Mean'    = mean(AvgValue),
          'Median'  = median(AvgValue),
          'StdDev'  = sd(AvgValue)),
        by = .(Name, Address)]
setkey(qs, 'Name')
Run Code Online (Sandbox Code Playgroud)

打印qs显示:

Name    Address Samples 25th  50th   75th   95th   99th    99.9th   99.99th   Mean     Median   StdDev
Stat1   Gold    779     0.65  0.765  0.9975 1.3995 1.4799  1.49799  1.499799  0.8825   0.765    0.4334647
Run Code Online (Sandbox Code Playgroud)

到目前为止,一切都很好。来自(小)玩具数据集的这些值似乎与 MS Excel 中 PERCENTILE() 函数的输出相关。

编辑:问题是:当我将此 R 代码应用于较大的数据集时,R 输出的值与 Excel 中 PERCENTILE() 函数输出的值不相关。在较低的百分位数中,值略有不同。在较高的百分位数中,值显着不同。以下是差异:

             25th           50th        75th        95th        99th        99.9th      99.99th
    R        0.414442227    0.428557466 0.45030771  1.668065665 42.7787092  146.9633133 349.6416913
    Excel    0.414774203    0.429350073 0.448245768 0.971100779 13.31231723 98.75342572 188.2700879
Run Code Online (Sandbox Code Playgroud)

这里有 20 个实际数据点(总共 11,283 个“黄金”行)。这些按降序排列:

AvgValue
349.1436739
190.189758
175.2157327
158.6492516
132.9550737
132.2686941
126.570912
122.9771829
107.6942185
99.98552912
98.93274272
98.75984129
98.73709105
98.30154271
98.2491005
96.97274385
96.94577839
96.9128099
96.90816688
96.82527478
Run Code Online (Sandbox Code Playgroud)

Excel 中的值似乎“更正确”(尤其是上百分位数)。

有人发现我的 R 代码有什么明显的错误吗?

如果不是,有什么想法可以解释为什么 R 中的值没有与 Excel 中的值联系起来吗?

也许是 Quantile() 函数的“Type”参数(我没有传入)?

谢谢!

The*_*aFC 7

我可以percentile通过type=7在函数中设置 来重现 Excel 函数R quantile[[7]]]查看下面的输出,并与在我的玩具矢量上lapply使用 Excel 得到的结果进行比较:percentiletestveclog

set.seed(12272019)
testveclog <- rlnorm(11283, meanlog=-0.12, sdlog=3)
lapply(1:9, function(x) quantile(testveclog, prob=c(0.95, 0.99, 0.999), type=x))

#[[1]]
#      95%       99%     99.9% 
# 131.0835  933.6057 6213.7963 

#[[2]]
#      95%       99%     99.9% 
# 131.0835  933.6057 6213.7963 

#[[3]]
#      95%       99%     99.9% 
# 131.0835  932.8875 6213.7963 

#[[4]]
#      95%       99%     99.9% 
# 131.0141  933.0096 6198.9585 

#[[5]]
#      95%       99%     99.9% 
# 131.1827  933.3687 6230.8209 

#[[6]]
#      95%       99%     99.9% 
# 131.3103  935.1852 6269.9696 

#[[7]]
#      95%       99%     99.9% 
# 131.0372  933.0168 6199.0109 

#[[8]]
#      95%       99%     99.9% 
# 131.2253  933.4860 6243.8705 

#[[9]]
#      95%       99%     99.9% 
# 131.2146  933.4567 6240.6081

writeClipboard(as.character(testveclog)) #copy and then paste into Excel to compare functions
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

请注意,在最新版本的 Excel 中,该PERCENTILE函数已被弃用,取而代之的是,它与使用的函数的PERCENTILE.EXC输出相匹配 Rquantiletype=6