我有一个包含多列的data.frame,并希望根据变量的组合过滤低频数据.这个例子就像男性/女性的性别变量和胆固醇变量的高/低.那我的数据框就像:
set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df
index Sex Age
1 1 Male High
2 2 Female High
3 3 Male High
4 4 Female High
5 5 Female High
6 6 Male High
7 7 Female High
8 8 Female High
9 9 Female Low
10 10 Male Low
11 11 Female High
12 12 Male High
13 13 Female High
14 14 Female High
15 15 Male Low
16 16 Female Low
17 17 Male High
18 18 Male Low
19 19 Male Low
20 20 Female Low
Run Code Online (Sandbox Code Playgroud)
现在我想过滤频率高于3的性别/年龄组合
table(df[,2:3])
Age
Sex High Low
Female 8 3
Male 5 4
Run Code Online (Sandbox Code Playgroud)
换句话说,我想保持女性高,男性低和男性高的指数.
请注意 1)我的数据框有几个变量(不像上面的例子)和2)我不想使用任何第三个R包和3)我希望它快.
这是基础R中的一个简单方法:
lvls <- interaction(df$Sex, df$Age)
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]
# index Sex Age
#1 1 Male High
#2 2 Female High
#3 3 Male High
#4 4 Female High
#5 5 Female High
#6 6 Male High
#7 7 Female High
#8 8 Female High
#10 10 Male Low
#11 11 Female High
#12 12 Male High
#13 13 Female High
#14 14 Female High
#15 15 Male Low
#17 17 Male High
#18 18 Male Low
#19 19 Male Low
Run Code Online (Sandbox Code Playgroud)
如果您有更多的变量,可以将它们存储在向量中:
vars <- c("Age", "Sex") # add more
lvls <- interaction(df[, vars])
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]
Run Code Online (Sandbox Code Playgroud)
这是第二个基础R方法使用ave:
subset(df, ave(as.integer(factor(Sex)), Sex, Age, FUN = "length") > 3)
Run Code Online (Sandbox Code Playgroud)