当使用等于(==)的因子对行进行子集化时,也包括NA.%in%不会发生这种情况.这是正常的吗?

use*_*392 5 r equals subset na

假设我有一个具有3级A1,A2,A3和NA的因子A. 每个出现10例,因此共有40例.如果我做

subset1 <- df[df$A=="A1",]  
dim(subset1)  # 20, i.e., 10 for A1 and 10 for NA's
summary(subset1$A) # both A1 and NA have non-zero counts
subset2 <- df[df$A %in% c("A1"),] 
dim(subset2)  # 10, as expected
summary(subset2$A) # only A1 has non-zero count
Run Code Online (Sandbox Code Playgroud)

用于子集化的变量类是因子还是整数是一样的.是否相等(和>,<)有效吗?那么我应该坚持%in%使用因素并始终包括!is.na在使用平等时?谢谢!

Sim*_*lon 5

是的,由于如何定义,返回类型==%in%不同NA"%in%"处......

# Data...
x <- c("A",NA,"A")

# When NA is encountered NA is returned
# Philosophically correct - who knows if the
# missing value at NA is equal to "A"?!
x=="A"
#[1] TRUE   NA TRUE
x[x=="A"]
#[1] "A" NA  "A"

# When NA is encountered by %in%, FALSE is returned, rather than NA
x %in% "A"
#[1]  TRUE FALSE  TRUE
x[ x %in% "A" ]
#[1] "A" "A"
Run Code Online (Sandbox Code Playgroud)

这是因为(来自文档)......

%in%是别名match,定义为

"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0
Run Code Online (Sandbox Code Playgroud)

如果我们将它重新定义为标准定义,match您将看到它的行为方式与之相同==

"%in2%" <- function(x,table) match(x, table, nomatch = NA_integer_) > 0
x %in2% "A"
#[1] TRUE   NA TRUE
Run Code Online (Sandbox Code Playgroud)