我正在处理一些美国政府数据,这些数据包含很长的城市和邮政编码列表.完成一些工作后,数据采用以下格式.
dat1 = data.frame(keyword=c("Bremen", "Brent", "Centreville, AL", "Chelsea, AL", "Bailytown, Alabama", "Calera, Alabama",
"54023", "54024"), tag=c(rep("AlabamCity",2), rep("AlabamaCityST",2), rep("AlabamaCityState",2), rep("AlabamaZipCode",2)))
dat1
Run Code Online (Sandbox Code Playgroud)
但是,某些关键字无法正常运行.因此,在下面的示例中,有两个"邮政编码"标记为"AlabamaCity"和"AlabamaCityState".出于某种原因,政府的原始数据集有几个拉链码,这些拉链码与其他邮政编码没有正确分组.
dat2 = data.frame(keyword=c("Bremen", "Brent", "50143", "Chelsea, AL", "Bailytown, Alabama", "52348",
"54023", "54024"), tag=c(rep("AlabamCity",2), rep("AlabamaCityST",2), rep("AlabamaCityState",2), rep("AlabamaZipCode",2)))
dat2
Run Code Online (Sandbox Code Playgroud)
我想知道如何迭代整个关键字列表并删除所有具有数值的行(它们实际上保存为字符值),这些行没有"AlabamaZipCode"标记.所以以前的数据应该看起来像.
dat3 = data.frame(keyword=c("Bremen", "Brent", "Chelsea, AL", "Bailytown, Alabama", "54023", "54024"),
tag=c(rep("AlabamCity",2), rep("AlabamaCityST",1), rep("AlabamaCityState",1), rep("AlabamaZipCode",2)))
dat3
Run Code Online (Sandbox Code Playgroud)
挑战似乎是我想要保留某些数值以及我想要删除的其他数值.谁能帮忙.
42-*_*42- 11
我认为两个grepl表达式应该可以解决这个问题:
> dat2[ !( grepl("City", dat2$tag) & grepl("^\\d", dat2$keyword) ) , ]
keyword tag
1 Bremen AlabamCity
2 Brent AlabamCity
4 Chelsea, AL AlabamaCityST
5 Bailytown, Alabama AlabamaCityState
7 54023 AlabamaZipCode
8 54024 AlabamaZipCode
Run Code Online (Sandbox Code Playgroud)
您正在消除有数字keyword和"城市"的行tag
它有助于将数据存储为字符,而不是因素:
dat2 <- data.frame(keyword=c("Bremen", "Brent", "50143", "Chelsea, AL",
"Bailytown, Alabama", "52348", "54023", "54024"),
tag=c(rep("AlabamCity",2), rep("AlabamaCityST",2),
rep("AlabamaCityState",2), rep("AlabamaZipCode",2)),
stringsAsFactors = FALSE) ## note this bit
Run Code Online (Sandbox Code Playgroud)
现在我们可以转换keyword为数字,如果它不是字符格式的数字,我们得到一个NA:
want <- with(dat2, as.numeric(keyword))
Run Code Online (Sandbox Code Playgroud)
这给了我们这个:
> (want <- with(dat2, as.numeric(keyword)))
[1] NA NA 50143 NA NA 52348 54023 54024
Warning message:
In eval(expr, envir, enclos) : NAs introduced by coercion
Run Code Online (Sandbox Code Playgroud)
您可以忽略该警告或禁止它,但不要随意使用它,因为它可以掩盖问题:
suppressWarnings(want <- with(dat2, as.numeric(keyword)))
Run Code Online (Sandbox Code Playgroud)
最后一步是选择的元素want是不能 NA 和有keyword相等"AlabamaZipCode",这是我们使用&:
(!is.na(want) & (dat2$tag != "AlabamaZipCode"))
Run Code Online (Sandbox Code Playgroud)
那选择我们不想行,所以我们要否定上面,转向TRUE以FALSE反之亦然:
!(!is.na(want) & (dat2$tag != "AlabamaZipCode"))
Run Code Online (Sandbox Code Playgroud)
将这些放在一起我们有:
dat2[!(!is.na(want) & (dat2$tag != "AlabamaZipCode")), ]
Run Code Online (Sandbox Code Playgroud)
这使:
> dat2[!(!is.na(want) & (dat2$tag != "AlabamaZipCode")), ]
keyword tag
1 Bremen AlabamCity
2 Brent AlabamCity
4 Chelsea, AL AlabamaCityST
5 Bailytown, Alabama AlabamaCityState
7 54023 AlabamaZipCode
8 54024 AlabamaZipCode
Run Code Online (Sandbox Code Playgroud)
完整解决方案是:
want <- with(dat2, as.numeric(keyword))
dat2[!(!is.na(want) & (dat2$tag != "AlabamaZipCode")), ]
Run Code Online (Sandbox Code Playgroud)