使用 case_when 进行字符串匹配的多个模式

W14*_*SMH 3 r case-when tidyverse

我正在尝试使用 str_detect 和 case_when 根据多个模式重新编码字符串,并将重新编码的值的每次出现粘贴到新列中。正确列是我试图实现的输出。

这类似于this questionthis question If it can't be done with case_when (仅限于我认为的一种模式)有没有更好的方法可以仍然使用tidyverse来实现?

Fruit=c("Apples","apples, maybe bananas","Oranges","grapes w apples","pears")
Num=c(1,2,3,4,5)
data=data.frame(Num,Fruit)

df= data %>% mutate(Incorrect=
paste(case_when(
  str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
  str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
  str_detect(Fruit, regex("grapes | oranges", ignore_case=TRUE)) ~ "ok",
  str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
  TRUE ~ "other"
),sep=","))

  Num                 Fruit Incorrect
  1                Apples      good
  2 apples, maybe bananas      good
  3               Oranges      other
  4       grapes w apples      good
  5                pears       other
Run Code Online (Sandbox Code Playgroud)

 Num                 Fruit    Correct
   1                Apples       good
   2 apples, maybe bananas good,gross
   3               Oranges         ok
   4       grapes w apples    ok,good
   5                pears       other
Run Code Online (Sandbox Code Playgroud)

Ron*_*hah 6

case_when如果条件满足一行它停在那里,不检查任何多个条件。所以通常在这种情况下,最好将每个条目放在单独的行中,以便更容易分配值,然后summarise将它们全部放在一起。但是,在这种情况下,Fruit列没有明确的分隔符,一些水果用逗号 ( ,)分隔,一些带有空格,并且它们之间还有额外的单词。为了处理所有此类情况,我们分配NA给不匹配的单词,然后在总结过程中将其删除。

library(dplyr)
library(stringr)

data %>%
  tidyr::separate_rows(Fruit, sep = ",|\\s+") %>%
   mutate(Correct = case_when(
      str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
      str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
      str_detect(Fruit, regex("grapes|oranges", ignore_case=TRUE)) ~ "ok",
      str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
      TRUE ~ NA_character_)) %>% 
   group_by(Num) %>%
   summarise(Correct = toString(na.omit(Correct))) %>%
   left_join(data)

#   Num Correct     Fruit                
#  <dbl> <chr>       <fct>                
#1     1 good        Apples               
#2     2 good, gross apples, maybe bananas
#3     3 ok          Oranges              
#4     4 ok, good    grapes w apples      
#5     5 sour        Lemons               
Run Code Online (Sandbox Code Playgroud)

对于更新后的数据,我们可以删除出现的额外单词并执行

data %>%
  mutate(Fruit = gsub("maybe|w", "", Fruit)) %>%
  tidyr::separate_rows(Fruit, sep = ",\\s+|\\s+") %>%
  mutate(Correct = case_when(
     str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
     str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
     str_detect(Fruit, regex("grapes|oranges", ignore_case=TRUE)) ~ "ok",
     str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
     TRUE ~ "other")) %>% 
  group_by(Num) %>%
  summarise(Correct = toString(na.omit(Correct))) %>%
  left_join(data)

#    Num Correct     Fruit                
#  <dbl> <chr>       <fct>                
#1     1 good        Apples               
#2     2 good, gross apples, maybe bananas
#3     3 ok          Oranges              
#4     4 ok, good    grapes w apples      
#5     5 other       pears                
Run Code Online (Sandbox Code Playgroud)