样本在R中的组内具有相同数量的每个性别

Cia*_*ran 3 statistics r

首先,第一件事.这是我的数据:

lat <- c(12, 12, 58, 58, 58, 58, 58, 45, 45, 45, 45, 45, 45, 64, 64, 64, 64, 64, 64, 64)
long <- c(-14, -14, 139, 139, 139, 139, 139, -68, -68, -68, -68, -68, 1, 1, 1, 1, 1, 1, 1, 1)
sex <- c("M", "M", "M", "M", "F", "M", "M", "F", "M", "M", "M", "F", "M", "F", "M", "F", "F", "F", "F", "M")
score <- c(2, 6, 3, 6, 5, 4, 3, 2, 3, 9, 9, 8, 6, 5, 6, 7, 5, 7, 5, 1)

data <- data.frame(lat, long, sex, score)
Run Code Online (Sandbox Code Playgroud)

数据应如下所示:

   lat long sex score
1   12  -14   M     2
2   12  -14   M     6
3   58  139   M     3
4   58  139   M     6
5   58  139   F     5
6   58  139   M     4
7   58  139   M     3
8   45  -68   F     2
9   45  -68   M     3
10  45  -68   M     9
11  45  -68   M     9
12  45  -68   F     8
13  45    1   M     6
14  64    1   F     5
15  64    1   M     6
16  64    1   F     7
17  64    1   F     5
18  64    1   F     7
19  64    1   F     5
20  64    1   M     1
Run Code Online (Sandbox Code Playgroud)

我最终试图弄清楚这一点.变量是纬度,经度,性别和分数.我想在每个位置拥有相同数量的男性和女性(即具有相同的经度和纬度).例如,第二个位置(第3行到第7行)只有一个女性.应保留这名女性,并保留其余个体的一名男性(也许是随机抽样).一些位置仅具有关于一种性别的信息,例如,第一位置(第1行和第2行)仅具有关于男性的数据.应删除此位置的行(因为没有女性).所有按计划进行的最终数据集应如下所示:

   lat2 long2 sex2 score2
1    58   139    F      5
2    58   139    M      4
3    45   -68    F      2
4    45   -68    M      3
5    45   -68    M      9
6    45   -68    F      8
7    64     1    M      6
8    64     1    F      5
9    64     1    F      7
10   64     1    M      1
Run Code Online (Sandbox Code Playgroud)

任何帮助,将不胜感激.

Sve*_*ein 5

这是一个解决方案lapply:

data[unlist(lapply(with(data, split(seq.int(nrow(data)), paste(lat, long))),
        # 'split' splits the sequence of row numbers (indices) along the unique
        # combinations of 'lat' and 'long'
        # 'lapply' applies the following function to all sub-sequences
        function(x) {
          # which of the indices are for males:
          male <- which(data[x, "sex"] == "M")
          # which of the indices are for females:
          female <- which(data[x, "sex"] == "F")
          # sample from the indices of males:
          s_male <- sample(male, min(length(male), length(female)))
          # sample from the indices of females:
          s_female <- sample(female, min(length(male), length(female)))
          # combine both sampled indices:
          x[c(s_male, s_female)]                
        })), ]
# The function 'lappy' returns a list of indices which is transformed to a vector
# using 'unlist'. These indices are used to subset the original data frame.
Run Code Online (Sandbox Code Playgroud)

结果:

   lat long sex score
9   45  -68   M     3
11  45  -68   M     9
12  45  -68   F     8
8   45  -68   F     2
7   58  139   M     3
5   58  139   F     5
20  64    1   M     1
15  64    1   M     6
19  64    1   F     5
16  64    1   F     7
Run Code Online (Sandbox Code Playgroud)