mon*_*nes 1 string r multiple-columns data.table
我有一个庞大的学生数据集,其中有荣誉学生的非标准命名惯例.我需要创建/填充一个新列,它将根据单词"Honors"返回Y或N以进行字符串匹配
目前我的数据看起来像这样,有超过200,000名学生
library(data.table)
students<-data.table(Student_ID = c(10001:10005),
Degree= c("Bachelor of Laws", "Honours Degree in Commerce", "Bachelor of Laws (with Honours)", "Bachelor of Nursing with Honours", "Bachelor of Nursing"))
Run Code Online (Sandbox Code Playgroud)
我需要添加第三列,以便在创建新列'Honors'数据表方式后,它将填充如下:
students<-data.table(Student_ID = c(10001:10005),
Degree= c("Bachelor of Laws", "Honours Degree in Commerce","Bachelor of Laws (with Honours)", "Bachelor of Nursing with Honours", "Bachelor of Nursing"),
Honours = c("N","Y", "Y", "Y","N"))
Run Code Online (Sandbox Code Playgroud)
任何帮助将不胜感激.
另外,按数据表的方式我的意思是:
students[,Honours:="N"]
Run Code Online (Sandbox Code Playgroud)
实际上这很简单
students[, Honours := c("N", "Y")[grepl("Honours", Degree, fixed = TRUE) + 1L]]
Run Code Online (Sandbox Code Playgroud)
您需要做的就是使用一些正则表达式实现函数搜索"Honors" grepl
,例如(这不是一个真正的表达式,因此您可以使用增强性能fixed = TREU
),然后c("N", "Y")
根据您的发现进行矢量子集化( a TRUE
/ FALSE
逻辑向量+ 1L,它将把它转换为一个向量,1,2
用于减去来自的值c("N", "Y")
)
或者,如果这太难阅读,您可以ifelse
改用
students[, Honours := ifelse(grepl("Honours", Degree, fixed = TRUE), "Y", "N")]
Run Code Online (Sandbox Code Playgroud)
当然,如果"荣誉"可以出现在不同的案例变体中,您可以将grepl
通话切换到grepl("Honours", Degree, ignore.case = TRUE)
PS
我建议坚持使用逻辑向量,因为之后你可以轻松地操作它
例如
students[, Honours := grepl("Honours", Degree, fixed = TRUE)]
Run Code Online (Sandbox Code Playgroud)
现在,如果你只想选择有"荣誉"的人,你就可以做到
students[(Honours)]
# Student_ID Degree Honours
# 1: 10002 Honours Degree in Commerce TRUE
# 2: 10003 Bachelor of Laws (with Honours) TRUE
# 3: 10004 Bachelor of Nursing with Honours TRUE
Run Code Online (Sandbox Code Playgroud)
或者没有"荣誉"的人
students[!(Honours)]
# Student_ID Degree Honours
# 1: 10001 Bachelor of Laws FALSE
# 2: 10005 Bachelor of Nursing FALSE
Run Code Online (Sandbox Code Playgroud)