如何从R中的字符串中删除某个模式中的重复单词

Question

如何从R中的字符串中删除某个模式中的重复单词

我的目标是仅从字符串集中的括号中删除重复的单词.

a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )

Run Code Online (Sandbox Code Playgroud)

我想要的就是这样

a
[1]'I (have|has) certain (words|word|worded) certain'
[2]'(You|Youre) (can|cans) do this (works|worked)'
[3]'I (am|are) pretty (sure|surely) you know (what|when) (you|her) should (do|)'

Run Code Online (Sandbox Code Playgroud)

为了得到结果,我使用了这样的代码

a = gsub('\\|', " | ",  a)
a = gsub('\\(', "(  ",  a)
a = gsub('\\)', "  )",  a)
a = vapply(strsplit(a, " "), function(x) paste(unique(x), collapse = " "), character(1L))

Run Code Online (Sandbox Code Playgroud)

但是,它导致了不良产出.

a    
[1] "I (  have | has ) certain words word worded"                 
[2] "(  You | Youre ) can cans do this works worked"              
[3] "I (  am | are ) sure surely you know what when her should do"

Run Code Online (Sandbox Code Playgroud)

为什么我的代码会删除位于字符串后半部分的括号？我应该怎样做我想要的结果？

Answer 1

akr*_*run 5

我们可以用gsubfn.这里的想法是通过匹配开括号(\\(必须转义括号,因为它是元字符),然后是一个或多个不是右括号([^)]+)的字符,选择括号内的字符,将其捕获为一个组内的组括号.在替换中,我们拆分组字符(x含)strsplit,unlist所述list输出,得到unique的元件和paste它一起

library(gsubfn)
gsubfn("\\(([^)]+)", ~paste0("(", paste(unique(unlist(strsplit(x, 
                "[|]"))), collapse="|")), a)
#[1] "I (have|has) certain (words|word|worded) certain"                   
#[2] "(You|Youre) (can|cans) do this (works|worked)"                      
#[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，10 月前
查看次数：	793 次
最近记录：	8 年，10 月前