删除R中字符串中的重复单词

Question

删除R中字符串中的重复单词

只是为了帮助那些刚刚自愿删除问题的人,按照他试过的代码请求和其他评论.我们假设他们尝试过这样的事情:

str <- "How do I best try and try and try and find a way to to improve this code?"
d <- unlist(strsplit(str, split=" "))
paste(d[-which(duplicated(d))], collapse = ' ')

Run Code Online (Sandbox Code Playgroud)

并希望学习更好的方法.那么从字符串中删除重复单词的最佳方法是什么？

Answer 1

Edv*_*ss 6

不需要额外的包裹

str <- c("How do I best try and try and try and find a way to to improve this code?",
         "And and here's a second one one and not a third One.")

Run Code Online (Sandbox Code Playgroud)

原子函数：

rem_dup.one <- function(x){
  paste(unique(tolower(trimws(unlist(strsplit(x,split="(?!')[ [:punct:]]",fixed=F,perl=T))))),collapse = " ")
}
rem_dup.one("And and here's a second one one and not a third One.")

Run Code Online (Sandbox Code Playgroud)

矢量化

rem_dup.vector <- Vectorize(rem_dup.one,USE.NAMES = F)
rem_dup.vector(str)

Run Code Online (Sandbox Code Playgroud)

结果

"how do i best try and find a way to improve this code" "and here's a second one not third"

Run Code Online (Sandbox Code Playgroud)

Answer 2

小智 6

删除除任何特殊字符之外的重复单词。使用这个功能

rem_dup_word <- function(x){
x <- tolower(x)
paste(unique(trimws(unlist(strsplit(x,split=" ",fixed=F,perl=T)))),collapse = 
" ")
}

Run Code Online (Sandbox Code Playgroud)

输入数据：

duptest <- "Samsung WA80E5LEC samsung Top Loading with Diamond Drum, 6 kg 
(Silver)"

rem_dup_word(duptest)

Run Code Online (Sandbox Code Playgroud)

输出：samsung wa80e5lec top loading with diamond drum 6 kg (silver)

它将把“Samsung”和“SAMSUNG”视为重复

Answer 3

cde*_*man 5

如果您仍然对替代解决方案感兴趣，则可以使用unique它来稍微简化您的代码。

paste(unique(d), collapse = ' ')

Run Code Online (Sandbox Code Playgroud)

根据Thomas的评论，您可能确实希望删除标点符号。R gsub具有一些不错的内部模式，您可以使用它们代替严格的正则表达式。当然，如果要执行一些更完善的正则表达式，则始终可以指定特定实例。

d <- gsub("[[:punct:]]", "", d)

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，5 月前
查看次数：	10808 次
最近记录：	7 年，9 月前