如何从单个字符串中提取会话话语

Question

如何从单个字符串中提取会话话语

我将几个演讲者之间的对话记录为一个字符串：

convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"

Run Code Online (Sandbox Code Playgroud)

我还有一个演讲者姓名的向量：

speakers <- c("Peter", "Mary", "al hamshi")

Run Code Online (Sandbox Code Playgroud)

使用这个向量作为我的正则表达式模式的一个组成部分，我在这个提取方面做得比较好：

library(stringr)
str_extract_all(convers, 
                paste("(?<=: )[\\w\\s]+(?= ", paste0(".*\\b(", paste(speakers, collapse = "|"), ")\\b.*"), ")", sep = ""))
[[1]]
[1] "hiya"                                        "hi how wz your weekend"                      "ahh still got a headache An you party a lot"
[4] "nuh you know my kid s sick n stuff"          "yeah i know thats erm al"                    "hey guys how s it goin"                     
[7] "Great"                                       "where ve you been last week"

Run Code Online (Sandbox Code Playgroud)

但是，第三个说话者姓名 ( al)的第一部分包含在其中一个提取的话语 ( yeah i know thats erm al) 中，并且输出中缺少说话者al hamshi( ah you know camping with my girl friend)的最后一个话语。如何改进正则表达式，以便正确匹配和提取所有话语？

Answer 1

Ron*_*hah 5

如果你采取另一种方法呢？

speakers从文本中删除所有内容并拆分字符串'\\s*:\\s*'

strsplit(gsub(paste(speakers, collapse = "|"), '', convers), '\\s*:\\s*')[[1]]

# [1] ""                                            "hiya"                                       
# [3] "hi how wz your weekend"                      "ahh still got a headache An you party a lot"
# [5] "nuh you know my kid s sick n stuff"          "yeah i know thats erm"                      
# [7] "hey guys how s it goin"                      "Great"                                      
# [9] "where ve you been last week"                 "ah you know camping with my girl friend"

Run Code Online (Sandbox Code Playgroud)

您可以稍微清理输出以从中删除第一个空值。

归档时间：	5 年，2 月前
查看次数：	49 次
最近记录：	5 年，2 月前