我想创建一个新变量,它等于其他两个变量之一的值,以其他变量的值为条件.这是一个假数据的玩具示例.
数据框的每一行代表一名学生.每个学生可以学习最多两个科目(subj1
和subj2
),并且可以在每个科目中攻读学位("BA")或未成年人("MN").我的真实数据包括数千名学生,几种类型的学位,约50个科目,学生可以有多达五个专业/未成年人.
ID subj1 degree1 subj2 degree2
1 1 BUS BA <NA> <NA>
2 2 SCI BA ENG BA
3 3 BUS MN ENG BA
4 4 SCI MN BUS BA
5 5 ENG BA BUS MN
6 6 SCI MN <NA> <NA>
7 7 ENG MN SCI BA
8 8 BUS BA ENG MN
...
Run Code Online (Sandbox Code Playgroud)
现在我想创建一个第六个变量,df$major
它等于subj1
if 的值subj1
是学生的主要专业,或者subj2
if 的值subj2
是主要专业.主要专业是第一个学位等于"BA"的学科.我尝试了以下代码:
df$major[df$degree1 == "BA"] = df$subj1
df$major[df$degree1 != "BA" & df$degree2 == "BA"] = df$subj2
Run Code Online (Sandbox Code Playgroud)
不幸的是,我收到一条错误消息:
> df$major[df$degree1 == "BA"] = df$subj1
Error in df$major[df$degree1 == "BA"] = df$subj1 :
NAs are not allowed in subscripted assignments
Run Code Online (Sandbox Code Playgroud)
我假设这意味着如果赋值至少评估一行的NA,则不能使用向量化赋值.
我觉得我必须遗漏一些基本的东西,但上面的代码似乎是显而易见的事情,我无法想出一个替代方案.
如果它有助于编写答案,这里的示例数据是使用dput()
与上面列出的假数据相同的格式创建的:
structure(list(ID = 1:20, subj1 = structure(c(3L, NA, 1L, 2L,
2L, 3L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 3L, 3L, 1L, 2L, 1L
), .Label = c("BUS", "ENG", "SCI"), class = "factor"), degree1 = structure(c(2L,
NA, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("BA", "MN"), class = "factor"), subj2 = structure(c(1L,
2L, NA, NA, 1L, NA, 3L, 2L, NA, 2L, 2L, 1L, 3L, NA, 2L, 1L, 1L,
NA, 2L, 2L), .Label = c("BUS", "ENG", "SCI"), class = "factor"),
degree2 = structure(c(2L, 2L, NA, NA, 2L, NA, 1L, 2L, NA,
2L, 1L, 1L, 2L, NA, 1L, 2L, 2L, NA, 1L, 2L), .Label = c("BA",
"MN"), class = "factor")), .Names = c("ID", "subj1", "degree1",
"subj2", "degree2"), row.names = c(NA, -20L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
Ben*_*nes 31
您的原始分配方法失败至少有两个原因.
1)下标分配的问题df$major[df$degree1 == "BA"] <-
.使用==
可以产生NA
,这是提示错误的原因.From ?"[<-"
:"当替换时(即在赋值的lhs上使用索引)NA不会选择任何要替换的元素.由于是否应该使用rhs的元素存在歧义,这仅允许rhs值的长度为1(因此两种解释会产生相同的结果)." 有很多方法可以解决这个问题,但我更喜欢使用which
:
df$major[which(df$degree1 == "BA")] <-
Run Code Online (Sandbox Code Playgroud)
不同的是,==
返回TRUE
,FALSE
和NA
,而which
回报是TRUE的对象的索引
> df$degree1 == "BA"
[1] FALSE NA TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> which(df$degree1 == "BA")
[1] 3 4 5 8 9 10 11 12 13 14 15 16 17 18 19 20
Run Code Online (Sandbox Code Playgroud)
2)当您执行下标任务时,右侧需要明智地适应左侧(这是我想到的方式).这可能意味着左右两边长度相等,这就是你的例子所暗示的.因此,您还需要对赋值的右侧进行子集化:
df$major[which(df$degree1 == "BA")] <- df$subj1[which(df$degree1 == "BA")]
Run Code Online (Sandbox Code Playgroud)
我希望澄清你原来的尝试产生错误的原因.
ifelse
正如@DavidRobinson所建议的那样,使用这种类型的赋值是一种很好的方法.我接受它:
df$major2 <- ifelse(df$degree1 == "BA", df$subj1, ifelse(df$degree2 == "BA",
df$subj2,NA))
Run Code Online (Sandbox Code Playgroud)
这相当于
df$major[which(df$degree1 == "BA")] <- df$subj1[which(df$degree1 == "BA")]
df$major[which(df$degree1 != "BA" & df$degree2 == "BA")] <-
df$subj2[which(df$degree1 != "BA" & df$degree2 == "BA")]
Run Code Online (Sandbox Code Playgroud)
根据嵌套ifelse
语句的深度,另一种方法可能更适合您的实际数据.
编辑:
我打算写第三个原因导致原始代码失败(即df$major
尚未分配),但它对我有用,而不必这样做.不过,这是我记得的一个问题.你在运行什么版本的R?(对我来说是2.15.0.)如果使用这种ifelse()
方法,则无需执行此步骤.使用时你的解决方案很好[
,尽管我会选择
df$major <- NA
Run Code Online (Sandbox Code Playgroud)
要获取主题的字符值,而不是因子级别索引,请使用as.character()
(因子相当于和调用levels(x)[x]
):
df$major[which(df$degree1 == "BA")] <- as.character(df$subj1)[which(df$degree1 == "BA")]
df$major[which(df$degree1 != "BA" & df$degree2 == "BA")] <-
as.character(df$subj2)[which(df$degree1 != "BA" & df$degree2 == "BA")]
Run Code Online (Sandbox Code Playgroud)
方式相同ifelse()
:
df$major2 <- ifelse(df$degree1 == "BA", as.character(df$subj1),
ifelse(df$degree2 == "BA", as.character(df$subj2), NA))
Run Code Online (Sandbox Code Playgroud)
通常,ifelse函数是这些情况的正确选择,例如:
df$major = ifelse((!is.na(df$degree1) & df$degree1 == "BA") & (is.na(df$degree2) | df$degree1 != "BA"), df$subj1, df$subj2)
Run Code Online (Sandbox Code Playgroud)
然而,其精确的使用取决于你做什么,如果这两个df$degree1
和df$degree2
是"BA".