在R中向量化复杂的dplyr语句

Question

在R中向量化复杂的dplyr语句

我正在努力计算参加课程的学生数量,从那些能够参加课程的学生,并非所有学校都提供计算机,不同的学校提供英语,那些能够学习计算和英语的学生会有所不同.例如,使用下面的测试数据,我们有:

df <- read.csv(text="school, student, course, result
      URN1,stu1,comp,A
      URN1,stu2,comp,B
      URN1,stu3,comp,C
      URN1,stu1,Eng,D
      URN1,stu1,ICT,E
      URN2,stu4,comp,A
      URN1,stu1,ICT,B
      URN2,stu5,comp,C
      URN3,stu6,comp,D
      URN3,stu6,ICT,E
      URN4,stu7,Eng,E
      URN4,stu8,ICT,E
      URN4,stu8,Eng,E
      URN5,stu9,comp,E
      URN5,stu10,ICT,E")

Run Code Online (Sandbox Code Playgroud)

[1]"由58.3333333333333%可能的学生参加"

[1]"33.3333333333333%可能的学生参加了"

[1]"信息通信技术由38.4615384615385%的可能学生参加"

我有以下循环(嘘!)来做到这一点:

library(magrittr)
library(dplyr)

for(c in unique(df$course)){
  # c <- "comp"
  #get URNs of schools offering each course
  URNs <- df %>% filter(course == c) %>% distinct(school) %$% school
  #get number of students in each school offering course c
  num_possible <- df %>% filter(school %in% URNs) %>% summarise(n = n()) %$% n
  #get number of students taking course c 
  num_actual <- df %>% filter(course == c) %>% summarise(n = n()) %$% n

  # get % of students taking course from those who could theoretically take c
  print(paste(c, "taken by", (100 * num_actual/num_possible), "% of possible students"))
}

Run Code Online (Sandbox Code Playgroud)

但是想要将它全部矢量化,但是,我无法将num_possible与num_actual放在同一个函数中:

df %>% group_by(course) %>% summarise(num_possible = somesubfunction(),
                                      num_actual = n())

Run Code Online (Sandbox Code Playgroud)

somesubfunction()应该返回可能参加课程的学生人数c

Answer 1

Cat*_*ath 5

如果您热衷于尝试与dplyr不同的东西,可以尝试使用data.table:

library(data.table)

setDT(df)[, nb_stu:=.N, by=course] # how many students by course
df[, nb_stu_ec:=length(unique(student)), by=school] # how many students per school (!: Edited to avoid counting some students twice if they take multiple courses)

# finally compute the number of student for a course 
# divided by the number of students in the schools that have this course (sprintf is only for formating the result):
df[, sprintf("%.2f", 100*first(nb_stu)/sum(nb_stu_ec[!duplicated(school)])), by=course]
#   course    V1
#1:   comp 87.50
#2:    Eng 60.00
#3:    ICT 62.50

Run Code Online (Sandbox Code Playgroud)

Nota Bene: 如果仅在最后一步计算每门课程的学生人数,则可以少一步实现:

setDT(df)[, nb_stu_ec:=length(unique(student)), by=school]
df[, sprintf("%.2f", 100*(.N)/sum(nb_stu_ec[!duplicated(school)])), by=course]

#   course    V1
#1:   comp 87.50
#2:    Eng 60.00
#3:    ICT 62.50

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，10 月前
查看次数：	161 次
最近记录：	7 年，10 月前