我正在努力计算参加课程的学生数量,从那些能够参加课程的学生,并非所有学校都提供计算机,不同的学校提供英语,那些能够学习计算和英语的学生会有所不同.例如,使用下面的测试数据,我们有:
df <- read.csv(text="school, student, course, result
URN1,stu1,comp,A
URN1,stu2,comp,B
URN1,stu3,comp,C
URN1,stu1,Eng,D
URN1,stu1,ICT,E
URN2,stu4,comp,A
URN1,stu1,ICT,B
URN2,stu5,comp,C
URN3,stu6,comp,D
URN3,stu6,ICT,E
URN4,stu7,Eng,E
URN4,stu8,ICT,E
URN4,stu8,Eng,E
URN5,stu9,comp,E
URN5,stu10,ICT,E")
Run Code Online (Sandbox Code Playgroud)
[1]"由58.3333333333333%可能的学生参加"
[1]"33.3333333333333%可能的学生参加了"
[1]"信息通信技术由38.4615384615385%的可能学生参加"
我有以下循环(嘘!)来做到这一点:
library(magrittr)
library(dplyr)
for(c in unique(df$course)){
# c <- "comp"
#get URNs of schools offering each course
URNs <- df %>% filter(course == c) %>% distinct(school) %$% school
#get number of students in each school offering course c
num_possible <- df %>% filter(school %in% URNs) %>% summarise(n = n()) %$% n
#get number of students taking course c
num_actual <- df %>% filter(course == c) %>% summarise(n = n()) %$% n
# get % of students taking course from those who could theoretically take c
print(paste(c, "taken by", (100 * num_actual/num_possible), "% of possible students"))
}
Run Code Online (Sandbox Code Playgroud)
但是想要将它全部矢量化,但是,我无法将num_possible与num_actual放在同一个函数中:
df %>% group_by(course) %>% summarise(num_possible = somesubfunction(),
num_actual = n())
Run Code Online (Sandbox Code Playgroud)
somesubfunction()应该返回可能参加课程的学生人数c
如果您热衷于尝试与dplyr不同的东西,可以尝试使用data.table:
library(data.table)
setDT(df)[, nb_stu:=.N, by=course] # how many students by course
df[, nb_stu_ec:=length(unique(student)), by=school] # how many students per school (!: Edited to avoid counting some students twice if they take multiple courses)
# finally compute the number of student for a course
# divided by the number of students in the schools that have this course (sprintf is only for formating the result):
df[, sprintf("%.2f", 100*first(nb_stu)/sum(nb_stu_ec[!duplicated(school)])), by=course]
# course V1
#1: comp 87.50
#2: Eng 60.00
#3: ICT 62.50
Run Code Online (Sandbox Code Playgroud)
Nota Bene: 如果仅在最后一步计算每门课程的学生人数,则可以少一步实现:
setDT(df)[, nb_stu_ec:=length(unique(student)), by=school]
df[, sprintf("%.2f", 100*(.N)/sum(nb_stu_ec[!duplicated(school)])), by=course]
# course V1
#1: comp 87.50
#2: Eng 60.00
#3: ICT 62.50
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
161 次 |
| 最近记录: |