R中数据帧中低频数据滤波的有效方法

MyQ*_*MyQ 4 r

我有一个包含多列的data.frame,并希望根据变量的组合过滤低频数据.这个例子就像男性/女性的性别变量和胆固醇变量的高/低.那我的数据框就像:

set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df


  index    Sex  Age
1      1   Male High
2      2 Female High
3      3   Male High
4      4 Female High
5      5 Female High
6      6   Male High
7      7 Female High
8      8 Female High
9      9 Female  Low
10    10   Male  Low
11    11 Female High
12    12   Male High
13    13 Female High
14    14 Female High
15    15   Male  Low
16    16 Female  Low
17    17   Male High
18    18   Male  Low
19    19   Male  Low
20    20 Female  Low
Run Code Online (Sandbox Code Playgroud)

现在我想过滤频率高于3的性别/年龄组合

table(df[,2:3])
        Age
Sex      High Low
  Female    8   3
  Male      5   4
Run Code Online (Sandbox Code Playgroud)

换句话说,我想保持女性高,男性低和男性高的指数.

请注意 1)我的数据框有几个变量(不像上面的例子)和2)我不想使用任何第三个R包和3)我希望它快.

tal*_*lat 7

这是基础R中的一个简单方法:

lvls <- interaction(df$Sex, df$Age)
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]

#   index    Sex  Age
#1      1   Male High
#2      2 Female High
#3      3   Male High
#4      4 Female High
#5      5 Female High
#6      6   Male High
#7      7 Female High
#8      8 Female High
#10    10   Male  Low
#11    11 Female High
#12    12   Male High
#13    13 Female High
#14    14 Female High
#15    15   Male  Low
#17    17   Male High
#18    18   Male  Low
#19    19   Male  Low
Run Code Online (Sandbox Code Playgroud)

如果您有更多的变量,可以将它们存储在向量中:

vars <- c("Age", "Sex") # add more
lvls <- interaction(df[, vars])
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]
Run Code Online (Sandbox Code Playgroud)

这是第二个基础R方法使用ave:

subset(df, ave(as.integer(factor(Sex)), Sex, Age, FUN = "length") > 3)
Run Code Online (Sandbox Code Playgroud)