den*_*nin 1 r subset string-matching
这是数据集:
company <- c("Coca-Cola Inc.", "DF, CocaCola",
"COCA-COLA", "PepsiCo Inc.", "Beverages Distribution")
brand <- c("Coca-Cola Zero","N/A", "Coca-Cola", "Pepsi", "soft drink")
vol <- c("2456","1653", "19", "2766", "167")
data <-data.frame(company, brand, vol)
data
Run Code Online (Sandbox Code Playgroud)
结果如下:
company brand vol
1 Coca-Cola Inc. Coca-Cola Zero 2456
2 DF, CocaCola N/A 1653
3 COCA-COLA CocaCola 19
4 PepsiCo Inc. Pepsi 2766
5 Beverages Distribution soft drink 167
Run Code Online (Sandbox Code Playgroud)
比方说,这是按品牌进口量.
任务是对数据帧进行SUBSET,以仅查看与可口可乐相关的观察结果,而不是任何其他品牌.
我们需要将COMPANY和BRAND变量与标准列表(键)进行部分匹配:
company_key <- c("coca-", "cocacola", "coca cola", "beverages distribution")
brand_key <- c("coca-", "cocacola", "coca cola")
Run Code Online (Sandbox Code Playgroud)
我正在努力执行这个想法:
SUBSET数据IF品牌PARTIALLY匹配来自brand_key矢量或公司的任何关键部分匹配来自company_key的任何键
所以,只留下以下几行:
(品牌观察部分匹配"可口可乐"或"可口可乐"或"可口可乐")
要么
(公司观察部分匹配"可口可乐"或"可口可乐"或"可口可乐"或"饮料分销")
注意:需要不区分大小写
理想的输出:
company brand vol
1 Coca-Cola Inc. Coca-Cola Zero 2456
2 DF, CocaCola N/A 1653
3 COCA-COLA CocaCola 19
4 Beverages Distribution soft drink 167
Run Code Online (Sandbox Code Playgroud)
有任何想法吗?提前致谢 :)
使用正则表达式及其|(或)运算符.参数ignore.case处理案例.
index <- grepl(paste0(company_key, collapse = "|"), data$company, ignore.case = TRUE) |
grepl(paste0(brand_key, collapse = "|"), data$company, ignore.case = TRUE)
data[index,]
# company brand vol
#1 Coca-Cola Inc. Coca-Cola Zero 2456
#2 DF, CocaCola N/A 1653
#3 COCA-COLA Coca-Cola 19
#5 Beverages Distribution soft drink 167
Run Code Online (Sandbox Code Playgroud)