Chr*_*ann 1 regex r regex-lookarounds
我将几个演讲者之间的对话记录为一个字符串:
convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
Run Code Online (Sandbox Code Playgroud)
我还有一个演讲者姓名的向量:
speakers <- c("Peter", "Mary", "al hamshi")
Run Code Online (Sandbox Code Playgroud)
使用这个向量作为我的正则表达式模式的一个组成部分,我在这个提取方面做得比较好:
library(stringr)
str_extract_all(convers,
paste("(?<=: )[\\w\\s]+(?= ", paste0(".*\\b(", paste(speakers, collapse = "|"), ")\\b.*"), ")", sep = ""))
[[1]]
[1] "hiya" "hi how wz your weekend" "ahh still got a headache An you party a lot"
[4] "nuh you know my kid s sick n stuff" "yeah i know thats erm al" "hey guys how s it goin"
[7] "Great" "where ve you been last week"
Run Code Online (Sandbox Code Playgroud)
但是,第三个说话者姓名 ( al)的第一部分包含在其中一个提取的话语 ( yeah i know thats erm al) 中,并且输出中缺少说话者al hamshi( ah you know camping with my girl friend)的最后一个话语。如何改进正则表达式,以便正确匹配和提取所有话语?
如果你采取另一种方法呢?
speakers从文本中删除所有内容并拆分字符串'\\s*:\\s*'
strsplit(gsub(paste(speakers, collapse = "|"), '', convers), '\\s*:\\s*')[[1]]
# [1] "" "hiya"
# [3] "hi how wz your weekend" "ahh still got a headache An you party a lot"
# [5] "nuh you know my kid s sick n stuff" "yeah i know thats erm"
# [7] "hey guys how s it goin" "Great"
# [9] "where ve you been last week" "ah you know camping with my girl friend"
Run Code Online (Sandbox Code Playgroud)
您可以稍微清理输出以从中删除第一个空值。
| 归档时间: |
|
| 查看次数: |
49 次 |
| 最近记录: |