lus*_*ser 2 indexing r dataframe
我已尝试使用内置数据集重现此问题,但它只发生在我自己的数据集中.
如果我们采用我的数据的随机子集:
structure(list(ID = structure(c(27L, 1L, 27L, 7L, 5L, 10L, 23L,
19L, 21L, 26L), .Label = c("AC ", "AJ ", "AT ", "AWY", "BP ",
"BW ", "CA ", "CK ", "CS ", "DJ ", "EN ", "ES ", "HF ", "HG ",
"HL ", "HR ", "IP ", "JA ", "JG ", "JN ", "KB ", "KP ", "MJ ",
"PC ", "RFH", "RPA", "SB ", "SG ", "TM "), class = "factor"),
TNO = c(30L, 60L, 30L, 10000L, 10000L, 10000L, 120L, 60L,
120L, 10000L), TNOGroup = structure(c(1L, 1L, 1L, 2L, 2L,
2L, 2L, 1L, 2L, 2L), .Label = c("Good", "Poor"), class = "factor"),
x = c(6.15, 7.75, 5.6, 3.05, 3, 4.1, 6, 3.9, 5.85, 3.75),
View = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L
), .Label = c("Binocular", "Monocular"), class = "factor"),
Prior = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L
), .Label = c("N", "Y"), class = "factor")), .Names = c("ID",
"TNO", "TNOGroup", "x", "View", "Prior"), row.names = c(169L,
49L, 24L, 16L, 9L, 4L, 35L, 18L, 164L, 36L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
然后尝试删除ID为双字符串的所有实例,例如"SB":
data2 <- data[!data$ID %in% c("SB"),] # List syntax in case multiple cases
Run Code Online (Sandbox Code Playgroud)
但是,当我检查数据帧时,ID仍为"SB"的案例仍然存在.当我使用三个字符的字符串(例如"RPA")尝试此操作时,具有此ID的所有案例都将按预期删除.
有关为什么会出现这种情况的任何见解?
作为替代方案%in%,我建议尝试grepl如下:
mydf[!grepl("CA", mydf$ID), ]
ID TNO TNOGroup x View Prior
1 AC 60 Good 5.75 Binocular Y
2 RFH 60 Good 5.60 Monocular N
3 BP 10000 Poor 3.00 Monocular N
4 HG 60 Good 4.30 Binocular Y
6 IP 120 Poor 5.50 Monocular N
7 JG 60 Good 3.80 Monocular Y
9 AWY 10000 Poor 3.70 Monocular Y
Run Code Online (Sandbox Code Playgroud)
我的怀疑(我无法在没有实际看到您提供的数据的子集的情况下验证dput)是"CA"值可能在它们周围有空格.对R来说,"CA"不一样"CA ",即使它们看起来可能是相同的data.frame.
如果正在读入的文件中有空格,则通常会发生类似这样的问题.默认情况下,R在决定是否删除该空格时是保守的,但它包含了strip.white用于read.table和family 的逻辑参数.
因此,您可以通过使用以下方法来避免此问题:
read.csv("yourfile.csv", strip.white = TRUE)
Run Code Online (Sandbox Code Playgroud)
另请注意,这不一定是更安全或更强烈推荐的替代方案%in%.使用grepl可能会产生意想不到的后果.例如,如果您有另一个ID "CAR",那么使用我共享的选项也会匹配.
甚至strip.white不能解决你所有的问题.如果您的CSV中引用了所有字符串,并且引号之间存在硬编码的空格,strip.white则会将这些空格视为符合预期.
这是一个基本的例子.
我们将创建一个CSV文件,其中第一行数据的空格为硬编码,第二行数据则没有.
myTest <- tempfile()
cat(file = myTest, 'A, B, C',
'"AA", "BB ", "CC"',
' AA, BB , CC',
sep = "\n")
Run Code Online (Sandbox Code Playgroud)
现在,使用read.csv和不使用读取文件strip.white = TRUE并比较输出.
A <- read.csv(myTest)
B <- read.csv(myTest, strip.white = TRUE)
print(A, quote = TRUE)
# A B C
# 1 "AA" " BB " " CC"
# 2 " AA" " BB " " CC"
print(B, quote = TRUE)
# A B C
# 1 "AA" "BB " "CC"
# 2 "AA" "BB" "CC"
unlink(myTest)
Run Code Online (Sandbox Code Playgroud)
请注意,在"B"表示空格未在引号之间进行硬编码的行中,空格被适当地修剪,但它仍保留在第一行中.要解决该问题,您可能需要使用一些正则表达式来删除字符串开头和结尾的空格.