R 中的正则表达式匹配方括号中的字符串

Chr*_*ann 4 regex r

我有讲故事的笔录,其中有许多重叠的语音实例,用方括号将重叠的语音括起来。我想提取这些重叠的实例。在下面的模拟示例中,

\n\n
ovl <- c("well [yes right]", "let\'s go", "oh [  we::ll] i do n\'t (0.5) know", "erm [\xc2\xb0well right\xc2\xb0 ]", "(3.2)")\n
Run Code Online (Sandbox Code Playgroud)\n\n

这段代码工作正常:

\n\n
pattern <- "\\\\[(.*\\\\w.+])*"\ngrep(pattern, ovl, value=T) \nmatches <- gregexpr(pattern, ovl) \noverlap <- regmatches(ovl, matches)\noverlap_clean <- unlist(overlap); overlap_clean\n[1] "[yes right]"     "[  we::ll]"      "[\xc2\xb0well right\xc2\xb0 ]"\n
Run Code Online (Sandbox Code Playgroud)\n\n

但在较大的文件(数据帧)中,则不然。这是由于模式错误还是由于数据帧的结构所致?df 的前六行如下所示:

\n\n
> head(df)\n                                                             Story\n1 "Kar:\\tMind you our Colin\'s getting more like your dad every day\n2                                             June:\\tI know he is.\n3                                 Kar:\\tblack welding glasses on, \n4                        \\tand he turned round and he made me jump\n5                                                 \\t\xe2\x80\x9cO:h, Colin\xe2\x80\x9d, \n6                                  \\tand then (                  )\n
Run Code Online (Sandbox Code Playgroud)\n

Tim*_*sen 5

虽然它在某些情况下可能有效,但你的模式对我来说看起来不合适。我想应该是这样的:

\n\n
pattern <- "(\\\\[.*?\\\\])"\nmatches <- gregexpr(pattern, ovl)\noverlap <- regmatches(ovl, matches)\noverlap_clean <- unlist(overlap)\noverlap_clean\n\n[1] "[yes right]"     "[  we::ll]"      "[\xc2\xb0well right\xc2\xb0 ]"\n
Run Code Online (Sandbox Code Playgroud)\n\n

演示

\n\n

这将匹配并捕获括号内的术语,使用 Perl 懒点来确保我们停在第一个右括号处。

\n