Soi*_*Guy 1 regex r pattern-matching
我正在尝试基于另一个使用模式匹配添加一个新列。我已经阅读了这篇文章,但没有得到想要的输出。
我想基于 GreatGroup 列创建一个新列 (SubOrder)。我尝试了以下方法:
SubOrder <- rep(NA_character_, length(myData))
SubOrder[grepl("udults", myData, ignore.case = TRUE)] <- "Udults"
SubOrder[grepl("aquults", myData, ignore.case = TRUE)] <- "Aquults"
SubOrder[grepl("aqualfs", myData, ignore.case = TRUE)] <- "aqualfs"
SubOrder[grepl("humods", myData, ignore.case = TRUE)] <- "humods"
SubOrder[grepl("udalfs", myData, ignore.case = TRUE)] <- "udalfs"
SubOrder[grepl("orthods", myData, ignore.case = TRUE)] <- "orthods"
SubOrder[grepl("udalfs", myData, ignore.case = TRUE)] <- "udalfs"
SubOrder[grepl("psamments", myData, ignore.case = TRUE)] <- "psamments"
SubOrder[grepl("udepts", myData, ignore.case = TRUE)] <- "udepts"
SubOrder[grepl("fluvents", myData, ignore.case = TRUE)] <- "fluvents"
SubOrder[grepl("aquods", myData, ignore.case = TRUE)] <- "aquods"
Run Code Online (Sandbox Code Playgroud)
例如,我要在任何单词中查找“udults”,例如 Hapludults 或 Paleudults,然后只返回“udults”。
编辑:如果有人想看看 alistaire 的评论,这就是我会使用的搜索模式。
subOrderNames <- c("Udults", "Aquults", "Aqualfs", "Humods", "Udalfs", "Orthods", "Psamments", "Udepts", "fluvents")
Run Code Online (Sandbox Code Playgroud)
下面的示例数据。
myData <- dput(head(test))
structure(list(1:6, SID = c(200502L, 200502L, 200502L, 200502L,
200502L, 200502L), Groupdepth = c(11L, 12L, 13L, 14L, 21L, 22L
), AWC0to10 = c(0.12, 0.12, 0.12, 0.12, 0.12, 0.12), AWC10to20 = c(0.12,
0.12, 0.12, 0.12, 0.12, 0.12), AWC20to50 = c(0.12, 0.12, 0.12,
0.12, 0.12, 0.12), AWC50to100 = c(0.15, 0.15, 0.15, 0.15, 0.15,
0.15), Db3rdbar0to10 = c(1.43, 1.43, 1.43, 1.43, 1.43, 1.43),
Db3rdbar10to20 = c(1.43, 1.43, 1.43, 1.43, 1.43, 1.43), Db3rdbar20to50 = c(1.43,
1.43, 1.43, 1.43, 1.43, 1.43), Db3rdbar50to100 = c(1.43,
1.43, 1.43, 1.43, 1.43, 1.43), HydrcRatngPP = c(0L, 0L, 0L,
0L, 0L, 0L), OrgMatter0to10 = c(1.25, 1.25, 1.25, 1.25, 1.25,
1.25), OrgMatter10to20 = c(1.25, 1.25, 1.25, 1.25, 1.25,
1.25), OrgMatter20to50 = c(1.02, 1.02, 1.02, 1.02, 1.02,
1.02), OrgMatter50to100 = c(0.12, 0.12, 0.12, 0.12, 0.12,
0.12), Clay0to10 = c(8, 8, 8, 8, 8, 8), Clay10to20 = c(8,
8, 8, 8, 8, 8), Clay20to50 = c(9.4, 9.4, 9.4, 9.4, 9.4, 9.4
), Clay50to100 = c(40, 40, 40, 40, 40, 40), Sand0to10 = c(85,
85, 85, 85, 85, 85), Sand10to20 = c(85, 85, 85, 85, 85, 85
), Sand20to50 = c(83, 83, 83, 83, 83, 83), Sand50to100 = c(45.8,
45.8, 45.8, 45.8, 45.8, 45.8), pHwater0to20 = c(6.3, 6.3,
6.3, 6.3, 6.3, 6.3), Ksat0to10 = c(23, 23, 23, 23, 23, 23
), Ksat10to20 = c(23, 23, 23, 23, 23, 23), Ksat20to50 = c(19.7333,
19.7333, 19.7333, 19.7333, 19.7333, 19.7333), Ksat50to100 = c(9,
9, 9, 9, 9, 9), TaxClName = c("Fine, mixed, semiactive, mesic Oxyaquic Hapludults",
"Fine, mixed, semiactive, mesic Oxyaquic Hapludults", "Fine, mixed, semiactive, mesic Oxyaquic Hapludults",
"Fine, mixed, semiactive, mesic Oxyaquic Hapludults", "Fine, mixed, semiactive, mesic Oxyaquic Hapludults",
"Fine, mixed, semiactive, mesic Oxyaquic Hapludults"), GreatGroup = c("Hapludults",
"Hapludults", "Hapludults", "Hapludults", "Hapludults", "Hapludults"
)), .Names = c("", "SID", "Groupdepth", "AWC0to10", "AWC10to20",
"AWC20to50", "AWC50to100", "Db3rdbar0to10", "Db3rdbar10to20",
"Db3rdbar20to50", "Db3rdbar50to100", "HydrcRatngPP", "OrgMatter0to10",
"OrgMatter10to20", "OrgMatter20to50", "OrgMatter50to100", "Clay0to10",
"Clay10to20", "Clay20to50", "Clay50to100", "Sand0to10", "Sand10to20",
"Sand20to50", "Sand50to100", "pHwater0to20", "Ksat0to10", "Ksat10to20",
"Ksat20to50", "Ksat50to100", "TaxClName", "GreatGroup"), class = c("tbl_df",
"data.frame"), row.names = c(NA, -6L))
Run Code Online (Sandbox Code Playgroud)
一些选项,其中一些是我在上面的评论中发布的。
注意:所有选项都假设匹配模式的字符串的替换只是模式。如果您想要其他东西,它们都可以轻松编辑以包含单独的替换值。
for+grepl使用与原始代码相同的代码,但循环以避免重复代码:
# make a list of patterns
pat <- c('udults', 'aquults', 'aqualfs', 'humods', 'udalfs', 'orthods', 'psamments', 'udepts', 'fluvents', 'aquods')
SubOrder <- rep(NA_character_, length(myData))
for(x in 1:length(pat)){
SubOrder[grepl(pat[x], myData$GreatGroup, ignore.case = TRUE)] <- pat[x]
}
Run Code Online (Sandbox Code Playgroud)
for+gsub通过复制myData$GreatGroup然后使用gsub. 粘贴的额外正则表达式包括同一字符串中的字符。
myData$SubOrder <- myData$GreatGroup
for(x in pat){
myData$SubOrder <- gsub(paste0('.*', x, '.*'), x, myData$SubOrder, ignore.case = TRUE)
}
Run Code Online (Sandbox Code Playgroud)
请注意,与其中一个字符串不匹配pat的值将具有值 from GreatGroup, not NA。如果您希望它们成为NA,请修复它们
myData$SubOrder[!(myData$SubOrder %in% pat)] <- NA
Run Code Online (Sandbox Code Playgroud)
stringr::str_replace_all我最喜欢它,因为它不循环,尽管它需要stringr包(无论如何都非常棒)。
创建一个命名列表 from pat,其中名称是要替换的正则表达式,项目是要匹配的字符串:
l <- as.list(pat)
names(l) <- paste0('.*', pat, '.*')
Run Code Online (Sandbox Code Playgroud)
所以它看起来像
> l
$`.*udults.*`
[1] "udults"
$`.*aquults.*`
[1] "aquults"
$`.*aqualfs.*`
[1] "aqualfs"
......
Run Code Online (Sandbox Code Playgroud)
然后用于str_replace_all一次性完成:
myData$SubOrder <- str_replace_all(myData$GreatGroup, l)
Run Code Online (Sandbox Code Playgroud)
繁荣。
注1: str_replace_all没有一个ignore.case选项,但你可以用myData$GreatGroup在tolower(容易)或重新配置正则表达式(硬)。
注意 2:与Option 2一样,它将不匹配的条目作为 from 的值保留GreatGroup,因此NA如果您愿意,请使用该选项末尾的行返回到s。