从数据框中删除特定行

ATM*_*hew 4 r dataframe

我正在处理一些美国政府数据,这些数据包含很长的城市和邮政编码列表.完成一些工作后,数据采用以下格式.

dat1 = data.frame(keyword=c("Bremen", "Brent", "Centreville, AL", "Chelsea, AL", "Bailytown, Alabama", "Calera, Alabama",
              "54023", "54024"), tag=c(rep("AlabamCity",2), rep("AlabamaCityST",2), rep("AlabamaCityState",2), rep("AlabamaZipCode",2)))
dat1
Run Code Online (Sandbox Code Playgroud)

但是,某些关键字无法正常运行.因此,在下面的示例中,有两个"邮政编码"标记为"AlabamaCity"和"AlabamaCityState".出于某种原因,政府的原始数据集有几个拉链码,这些拉链码与其他邮政编码没有正确分组.

dat2 = data.frame(keyword=c("Bremen", "Brent", "50143", "Chelsea, AL", "Bailytown, Alabama", "52348",
              "54023", "54024"), tag=c(rep("AlabamCity",2), rep("AlabamaCityST",2), rep("AlabamaCityState",2), rep("AlabamaZipCode",2)))
dat2
Run Code Online (Sandbox Code Playgroud)

我想知道如何迭代整个关键字列表并删除所有具有数值的行(它们实际上保存为字符值),这些行没有"AlabamaZipCode"标记.所以以前的数据应该看起来像.

dat3 = data.frame(keyword=c("Bremen", "Brent", "Chelsea, AL", "Bailytown, Alabama", "54023", "54024"), 
          tag=c(rep("AlabamCity",2), rep("AlabamaCityST",1), rep("AlabamaCityState",1), rep("AlabamaZipCode",2)))
dat3
Run Code Online (Sandbox Code Playgroud)

挑战似乎是我想要保留某些数值以及我想要删除的其他数值.谁能帮忙.

42-*_*42- 11

我认为两个grepl表达式应该可以解决这个问题:

> dat2[ !( grepl("City", dat2$tag) &  grepl("^\\d", dat2$keyword) ) , ]
             keyword              tag
1             Bremen       AlabamCity
2              Brent       AlabamCity
4        Chelsea, AL    AlabamaCityST
5 Bailytown, Alabama AlabamaCityState
7              54023   AlabamaZipCode
8              54024   AlabamaZipCode
Run Code Online (Sandbox Code Playgroud)

您正在消除有数字keyword和"城市"的行tag


Rei*_*son 5

它有助于将数据存储为字符,而不是因素:

dat2 <- data.frame(keyword=c("Bremen", "Brent", "50143", "Chelsea, AL", 
                             "Bailytown, Alabama", "52348", "54023", "54024"),   
                   tag=c(rep("AlabamCity",2), rep("AlabamaCityST",2), 
                         rep("AlabamaCityState",2), rep("AlabamaZipCode",2)),
                   stringsAsFactors = FALSE) ## note this bit
Run Code Online (Sandbox Code Playgroud)

现在我们可以转换keyword为数字,如果它不是字符格式的数字,我们得到一个NA:

want <- with(dat2, as.numeric(keyword))
Run Code Online (Sandbox Code Playgroud)

这给了我们这个:

> (want <- with(dat2, as.numeric(keyword)))
[1]    NA    NA 50143    NA    NA 52348 54023 54024
Warning message:
In eval(expr, envir, enclos) : NAs introduced by coercion
Run Code Online (Sandbox Code Playgroud)

您可以忽略该警告或禁止它,但不要随意使用它,因为它可以掩盖问题:

suppressWarnings(want <- with(dat2, as.numeric(keyword)))
Run Code Online (Sandbox Code Playgroud)

最后一步是选择的元素want不能 NA keyword相等"AlabamaZipCode",这是我们使用&:

(!is.na(want) & (dat2$tag != "AlabamaZipCode"))
Run Code Online (Sandbox Code Playgroud)

那选择我们不想行,所以我们要否定上面,转向TRUEFALSE反之亦然:

!(!is.na(want) & (dat2$tag != "AlabamaZipCode"))
Run Code Online (Sandbox Code Playgroud)

将这些放在一起我们有:

dat2[!(!is.na(want) & (dat2$tag != "AlabamaZipCode")), ]
Run Code Online (Sandbox Code Playgroud)

这使:

> dat2[!(!is.na(want) & (dat2$tag != "AlabamaZipCode")), ]
             keyword              tag
1             Bremen       AlabamCity
2              Brent       AlabamCity
4        Chelsea, AL    AlabamaCityST
5 Bailytown, Alabama AlabamaCityState
7              54023   AlabamaZipCode
8              54024   AlabamaZipCode
Run Code Online (Sandbox Code Playgroud)

完整解决方案是:

want <- with(dat2, as.numeric(keyword))
dat2[!(!is.na(want) & (dat2$tag != "AlabamaZipCode")), ]
Run Code Online (Sandbox Code Playgroud)