将连续变量分成相等大小的组

Question

将连续变量分成相等大小的组

baz*_*baz 53 variables split r continuous

我需要将连续变量拆分/分成3个相等大小的组.

示例数据框

das <- data.frame(anim=1:15,
                  wt=c(181,179,180.5,201,201.5,245,246.4,
                       189.3,301,354,369,205,199,394,231.3))

Run Code Online (Sandbox Code Playgroud)

在被剪切后(根据值wt),我需要在新变量下面有3个类,wt2如下所示:

> das 
   anim    wt wt2
1     1 181.0   1
2     2 179.0   1
3     3 180.5   1
4     4 201.0   2
5     5 201.5   2
6     6 245.0   2
7     7 246.4   3
8     8 189.3   1
9     9 301.0   3
10   10 354.0   3
11   11 369.0   3
12   12 205.0   2
13   13 199.0   1
14   14 394.0   3
15   15 231.3   2

Run Code Online (Sandbox Code Playgroud)

这将应用于大型数据集

Answer 1

koh*_*ske 60

试试这个:

split(das, cut(das$anim, 3))

Run Code Online (Sandbox Code Playgroud)

如果你想根据值分割wt,那么

library(Hmisc) # cut2
split(das, cut2(das$wt, g=3))

Run Code Online (Sandbox Code Playgroud)

无论如何,你可以通过组合cut,cut2和split.

更新

如果您想将组索引作为附加列,那么

das$group <- cut(das$anim, 3)

Run Code Online (Sandbox Code Playgroud)

如果列应该像1,2,...那样索引

das$group <- as.numeric(cut(das$anim, 3))

Run Code Online (Sandbox Code Playgroud)

再次更新

试试这个:

> das$wt2 <- as.numeric(cut2(das$wt, g=3))
> das
   anim    wt wt2
1     1 181.0   1
2     2 179.0   1
3     3 180.5   1
4     4 201.0   2
5     5 201.5   2
6     6 245.0   2
7     7 246.4   3
8     8 189.3   1
9     9 301.0   3
10   10 354.0   3
11   11 369.0   3
12   12 205.0   2
13   13 199.0   1
14   14 394.0   3
15   15 231.3   2

Run Code Online (Sandbox Code Playgroud)

让我感到困惑的是，为什么这是公认的答案，当问题明确指出“大小相等的组”时，“cut()”无法实现这一点。 (3认同)
你可以删除as.numeric并使用`cut(das $ anim,3,labels = FALSE)` (2认同)
这应该更新，所以很明显它与下面@Ben的答案不同。我错误地使用了这段代码，因为我认为它会平均划分观察结果。 (2认同)

Answer 2

Ben*_*ker 37

或者cut_number从ggplot2包装中看出,例如

das$wt_2 <- as.numeric(cut_number(das$wt,3))

Run Code Online (Sandbox Code Playgroud)

注意,cut(...,3)将原始数据的范围划分为三个相等长度的范围; 如果数据分布不均匀,则不一定会导致每组观察次数相同(您可以cut_number通过quantile适当的方式复制内容,但这是一个很好的便利功能).另一方面,Hmisc::cut2()使用g=参数确实按分位数分割,因此或多或少相当于ggplot2::cut_number.我可能认为到目前为止cut_number会有类似的东西进入dplyr,但据我所知它没有.

Answer 3

Ben*_*Ben 7

这是使用mltools软件包中的bin_data()功能的另一种解决方案。

library(mltools)

# Resulting bins have an equal number of observations in each group
das[, "wt2"] <- bin_data(das$wt, bins=3, binType = "quantile")

# Resulting bins are equally spaced from min to max
das[, "wt3"] <- bin_data(das$wt, bins=3, binType = "explicit")

# Or if you'd rather define the bins yourself
das[, "wt4"] <- bin_data(das$wt, bins=c(-Inf, 250, 322, Inf), binType = "explicit")

das
   anim    wt                                  wt2                                  wt3         wt4
1     1 181.0              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
2     2 179.0              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
3     3 180.5              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
4     4 201.0 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
5     5 201.5 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
6     6 245.0 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
7     7 246.4              [245.466666666667, 394]              [179, 250.666666666667) [-Inf, 250)
8     8 189.3              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
9     9 301.0              [245.466666666667, 394] [250.666666666667, 322.333333333333)  [250, 322)
10   10 354.0              [245.466666666667, 394]              [322.333333333333, 394]  [322, Inf]
11   11 369.0              [245.466666666667, 394]              [322.333333333333, 394]  [322, Inf]
12   12 205.0 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
13   13 199.0              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
14   14 394.0              [245.466666666667, 394]              [322.333333333333, 394]  [322, Inf]
15   15 231.3 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)

Run Code Online (Sandbox Code Playgroud)

Answer 4

Mat*_*cho 6

如果您想分成 3 个均等分布的组，答案与上面 Ben Bolker 的答案相同- 使用ggplot2::cut_number(). 为了完整起见，这里是将连续转换为分类（分箱）的 3 种方法。

cut_number()：使 n 组具有（大约）相同数量的观察
cut_interval(): 使 n 组具有相等的范围
cut_width(): 制作一组宽度

我的首选是cut_number()因为这使用均匀间隔的分位数进行分箱观察。这是一个带有倾斜数据的示例。

library(tidyverse)

skewed_tbl <- tibble(
    counts = c(1:100, 1:50, 1:20, rep(1:10, 3), 
               rep(1:5, 5), rep(1:2, 10), rep(1, 20))
    ) %>%
    mutate(
        counts_cut_number   = cut_number(counts, n = 4),
        counts_cut_interval = cut_interval(counts, n = 4),
        counts_cut_width    = cut_width(counts, width = 25)
        ) 

# Data
skewed_tbl
#> # A tibble: 265 x 4
#>    counts counts_cut_number counts_cut_interval counts_cut_width
#>     <dbl> <fct>             <fct>               <fct>           
#>  1      1 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  2      2 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  3      3 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  4      4 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  5      5 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  6      6 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  7      7 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  8      8 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  9      9 (3,13]            [1,25.8]            [-12.5,12.5]    
#> 10     10 (3,13]            [1,25.8]            [-12.5,12.5]    
#> # ... with 255 more rows

summary(skewed_tbl$counts)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.00    3.00   13.00   25.75   42.00  100.00

# Histogram showing skew
skewed_tbl %>%
    ggplot(aes(counts)) +
    geom_histogram(bins = 30)

Run Code Online (Sandbox Code Playgroud)

# cut_number() evenly distributes observations into bins by quantile
skewed_tbl %>%
    ggplot(aes(counts_cut_number)) +
    geom_bar()

Run Code Online (Sandbox Code Playgroud)

# cut_interval() evenly splits the interval across the range
skewed_tbl %>%
    ggplot(aes(counts_cut_interval)) +
    geom_bar()

Run Code Online (Sandbox Code Playgroud)

# cut_width() uses the width = 25 to create bins that are 25 in width
skewed_tbl %>%
    ggplot(aes(counts_cut_width)) +
    geom_bar()

Run Code Online (Sandbox Code Playgroud)

^{由reprex 包(v0.2.1)于 2018 年 11 月 1 日创建}

Answer 5

ped*_*rio 5

不使用cut2的替代方法。

das$wt2 <- as.factor( as.numeric( cut(das$wt,3)))

Run Code Online (Sandbox Code Playgroud)

要么

das$wt2 <- as.factor( cut(das$wt,3, labels=F))

Run Code Online (Sandbox Code Playgroud)

正如@ ben-bolker指出的那样，它分成相等的宽度而不是占用率。我认为使用quantiles一个可以近似占用

x = rnorm(10)
x
 [1] -0.1074316  0.6690681 -1.7168853  0.5144931  1.6460280  0.7014368
 [7]  1.1170587 -0.8503069  0.4462932 -0.1089427
bin = 3 #for 1/3 rd, 4 for 1/4, 100 for 1/100th etc
xx = cut(x, quantile(x, breaks=1/bin*c(1:bin)), labels=F, include.lowest=T)
table(xx)
1 2 3 4
3 2 2 3

Run Code Online (Sandbox Code Playgroud)

我认为这分裂成等宽而不是等人的垃圾箱？ (7认同)

Answer 6

Dan*_*wer 5

ntile从dplyr现在开始执行此操作，但对的行为却很奇怪NA。

我在下面的函数中使用了类似的代码，该函数可在base R中使用，并且等效于上述cut2解决方案：

ntile_ <- function(x, n) {
    b <- x[!is.na(x)]
    q <- floor((n * (rank(b, ties.method = "first") - 1)/length(b)) + 1)
    d <- rep(NA, length(x))
    d[!is.na(x)] <- q
    return(d)
}

Run Code Online (Sandbox Code Playgroud)

Answer 7

Moo*_*per 5

cut，当没有给出明确的断点将值划分为相同宽度的箱时，它们通常不会包含相同数量的项目：

x <- c(1:4,10)
lengths(split(x, cut(x, 2)))
# (0.991,5.5]    (5.5,10] 
#           4           1

Run Code Online (Sandbox Code Playgroud)

Hmisc::cut2并ggplot2::cut_number使用分位数，如果数据分布良好且大小合适，通常会创建相同大小的组（就元素数量而言），但情况并非总是如此。mltools::bin_data可以给出不同的结果，但也是基于分位数。

当数据包含少量不同值时，这些函数并不总是给出整洁的结果：

x <- rep(c(1:20),c(15, 7, 10, 3, 9, 3, 4, 9, 3, 2,
                   23, 2, 4, 1, 1, 7, 18, 37, 6, 2))

table(x)
# x
#  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
# 15  7 10  3  9  3  4  9  3  2 23  2  4  1  1  7 18 37  6  2   

table(Hmisc::cut2(x, g=4))
# [ 1, 6) [ 6,12) [12,19) [19,20] 
#      44      44      70       8

table(ggplot2::cut_number(x, 4))
# [1,5]  (5,11] (11,18] (18,20] 
#    44      44      70       8

table(mltools::bin_data(x, bins=4, binType = "quantile"))
# [1, 5)  [5, 11) [11, 18) [18, 20] 
#     35       30       56       45

Run Code Online (Sandbox Code Playgroud)

目前还不清楚这里是否找到了最优解。

什么是最好的分箱方法是一个主观问题，但一种合理的方法是寻找能够最小化预期分箱大小方差的分箱。

smart_cut（我的）包中的函数cutr提出了这样的功能。但它的计算量很大，应该保留给切点和唯一值很少的情况（这通常是重要的情况）。

# devtools::install_github("moodymudskipper/cutr")
table(cutr::smart_cut(x, list(4, "balanced"), "g"))
# [1,6)  [6,12) [12,18) [18,20] 
# 44      44      33      45

Run Code Online (Sandbox Code Playgroud)

我们看到各组的平衡性要好得多。

"balanced"实际上，如果基于方差的方法不够，则可以用自定义函数替换调用中的函数，以根据需要优化或限制箱。

归档时间：	14 年，8 月前
查看次数：	97899 次
最近记录：	6 年，5 月前