删除除R中的撇号之外的所有标点符号

Tyl*_*ker 30 r

我想使用R的gsub从文本中删除除撇号之外的所有标点符号.我对正则表达式很新,但我正在学习.

例:

x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[[:punct:]]", "", as.character(x))
Run Code Online (Sandbox Code Playgroud)

电流输出(没有撇号)

[1] "I like to chew gum but dont like bubble gum"
Run Code Online (Sandbox Code Playgroud)

期望的输出(我希望撇号不要留下)

[1] "I like to chew gum but don't like bubble gum"
Run Code Online (Sandbox Code Playgroud)

Kay*_*Kay 38

x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[^[:alnum:][:space:]']", "", x)

[1] "I like to chew gum but don't like bubble gum"
Run Code Online (Sandbox Code Playgroud)

上面的正则表达式要简单得多.它用空字符串替换不是字母数字符号,空格或撇号(插入符号!)的所有内容.

  • +1 - 在我看来,这里的想法指出了最清晰的解决方案.只需编辑第二行来读取`gsub("[^ [:alnum:] [:space:]']","","x)`并且它是金色的.(FWIW,正则表达式中不需要反斜杠). (3认同)

koh*_*ske 7

这是一个例子:

>  gsub("(.*?)($|'|[^[:punct:]]+?)(.*?)", "\\2", x)
[1] "I like to chew gum but don't like bubble gum"
Run Code Online (Sandbox Code Playgroud)

  • 最后这将是最简单的方法`gsub(".*?($|'|[^[:punct:]]).*?", "\\1", x)`。 (2认同)

Mar*_*ano 6

您可以punct使用双负号从POSIX类中排除撇号:

[^'[:^punct:]]
Run Code Online (Sandbox Code Playgroud)

码:

x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[^'[:^punct:]]", "", x, perl=T)

#[1] "I like to chew gum but don't like bubble gum"
Run Code Online (Sandbox Code Playgroud)

ideone demo


Jos*_*ien 5

大多数情况下,这是一个使用gsubfn()相同名称的极好包的解决方案.在这个应用程序中,我只是喜欢它允许的解决方案表达得非常好:

library(gsubfn)
gsubfn(pattern = "[[:punct:]]", engine = "R",
       replacement = function(x) ifelse(x == "'", "'", ""), 
       x)
[1] "I like to chew gum but don't like bubble gum"
Run Code Online (Sandbox Code Playgroud)

(engine = "R"这里需要这个参数,否则将使用默认的tcl引擎.它的匹配正则表达式的规则略有不同:例如,如果它用于处理上面的字符串,则需要改为设置pattern = "[[:punct:]$|^]".感谢G.格洛腾迪克指出这个细节.)

  • 一个警告 - 由于某种原因,字符类`[:punct:]`,当在`gsubfn()`调用的`pattern`参数中使用时,与`$`,`|`或者字符不匹配`^`就像调用`gsub()`一样.因此我不得不"手工"添加它们. (2认同)
  • `gsubfn`默认使用tcl正则表达式.如果你想使用R正则表达式,请使用参数`engine ="R"`. (2认同)