从文本向量 r 中删除多个模式

vag*_*ond 2 r vector gsub mapply

我想从多个字符向量中删除多个模式。目前我要去:

a.vector <- gsub("@\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)
Run Code Online (Sandbox Code Playgroud)

等等等等

这很痛苦。我正在看这个问题和答案:R: gsub, pattern = vector and replacement = vector但它没有解决问题。

themapply和 themgsub都没有工作。我做了这些载体

remove <- c("@\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")
Run Code Online (Sandbox Code Playgroud)

既不mapply(gsub, remove, substitute, a.vector)也不mgsub(remove, substitute, a.vector) worked.

a.vector 看起来像这样:

[4951] "@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[4952] "@stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"   
Run Code Online (Sandbox Code Playgroud)

我想要:

[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[4952] "you are phenomenal #mental #Writing"   `
Run Code Online (Sandbox Code Playgroud)

Mar*_*nar 6

我知道这个答案在现场很晚,但它源于我不喜欢必须手动列出grep函数内的删除模式(请参阅此处的其他解决方案)。我的想法是预先设置模式,将它们保留为字符向量,然后使用regexseparator粘贴它们(即“需要时”)"|"

library(stringr)

remove <- c("@\\w+", "http\\w+", "[[:punct:]]")

a.vector <- str_remove_all(a.vector, paste(remove, collapse = "|"))
Run Code Online (Sandbox Code Playgroud)

是的,这确实与此处的其他一些答案有效相同,但我认为我的解决方案允许您保留原始的“字符删除向量” remove


kdo*_*pen 5

尝试使用|. 例如

>s<-"@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("@\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
Run Code Online (Sandbox Code Playgroud)

但是,如果您有大量模式,或者如果应用一种模式的结果与其他模式相匹配,这可能会成为问题。

考虑remove按照您的建议创建向量,然后将其应用到循环中

> s1 <- s
> remove<-c("@\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
Run Code Online (Sandbox Code Playgroud)

当然,这种方法需要扩展以将其应用于整个表或向量。但是如果你把它放到一个返回最终字符串的函数中,你应该能够将它传递给其中一个apply变体