如何从单个字符串中提取会话话语

Chr*_*ann 1 regex r regex-lookarounds

我将几个演讲者之间的对话记录为一个字符串:

convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
Run Code Online (Sandbox Code Playgroud)

我还有一个演讲者姓名的向量:

speakers <- c("Peter", "Mary", "al hamshi")
Run Code Online (Sandbox Code Playgroud)

使用这个向量作为我的正则表达式模式的一个组成部分,我在这个提取方面做得比较好:

library(stringr)
str_extract_all(convers, 
                paste("(?<=: )[\\w\\s]+(?= ", paste0(".*\\b(", paste(speakers, collapse = "|"), ")\\b.*"), ")", sep = ""))
[[1]]
[1] "hiya"                                        "hi how wz your weekend"                      "ahh still got a headache An you party a lot"
[4] "nuh you know my kid s sick n stuff"          "yeah i know thats erm al"                    "hey guys how s it goin"                     
[7] "Great"                                       "where ve you been last week"
Run Code Online (Sandbox Code Playgroud)

但是,第三个说话者姓名 ( al)的第一部分包含在其中一个提取的话语 ( yeah i know thats erm al) 中,并且输出中缺少说话者al hamshi( ah you know camping with my girl friend)的最后一个话语。如何改进正则表达式,以便正确匹配和提取所有话语?

Ron*_*hah 5

如果你采取另一种方法呢?

speakers从文本中删除所有内容并拆分字符串'\\s*:\\s*'

strsplit(gsub(paste(speakers, collapse = "|"), '', convers), '\\s*:\\s*')[[1]]

# [1] ""                                            "hiya"                                       
# [3] "hi how wz your weekend"                      "ahh still got a headache An you party a lot"
# [5] "nuh you know my kid s sick n stuff"          "yeah i know thats erm"                      
# [7] "hey guys how s it goin"                      "Great"                                      
# [9] "where ve you been last week"                 "ah you know camping with my girl friend"   
Run Code Online (Sandbox Code Playgroud)

您可以稍微清理输出以从中删除第一个空值。