我想使用R的gsub从文本中删除除撇号之外的所有标点符号.我对正则表达式很新,但我正在学习.
例:
x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[[:punct:]]", "", as.character(x))
Run Code Online (Sandbox Code Playgroud)
电流输出(没有撇号)
[1] "I like to chew gum but dont like bubble gum"
Run Code Online (Sandbox Code Playgroud)
期望的输出(我希望撇号不要留下)
[1] "I like to chew gum but don't like bubble gum"
Run Code Online (Sandbox Code Playgroud)
Kay*_*Kay 38
x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[^[:alnum:][:space:]']", "", x)
[1] "I like to chew gum but don't like bubble gum"
Run Code Online (Sandbox Code Playgroud)
上面的正则表达式要简单得多.它用空字符串替换不是字母数字符号,空格或撇号(插入符号!)的所有内容.
这是一个例子:
> gsub("(.*?)($|'|[^[:punct:]]+?)(.*?)", "\\2", x)
[1] "I like to chew gum but don't like bubble gum"
Run Code Online (Sandbox Code Playgroud)
您可以punct
使用双负号从POSIX类中排除撇号:
[^'[:^punct:]]
Run Code Online (Sandbox Code Playgroud)
码:
x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[^'[:^punct:]]", "", x, perl=T)
#[1] "I like to chew gum but don't like bubble gum"
Run Code Online (Sandbox Code Playgroud)
大多数情况下,这是一个使用gsubfn()
相同名称的极好包的解决方案.在这个应用程序中,我只是喜欢它允许的解决方案表达得非常好:
library(gsubfn)
gsubfn(pattern = "[[:punct:]]", engine = "R",
replacement = function(x) ifelse(x == "'", "'", ""),
x)
[1] "I like to chew gum but don't like bubble gum"
Run Code Online (Sandbox Code Playgroud)
(engine = "R"
这里需要这个参数,否则将使用默认的tcl引擎.它的匹配正则表达式的规则略有不同:例如,如果它用于处理上面的字符串,则需要改为设置pattern = "[[:punct:]$|^]"
.感谢G.格洛腾迪克指出这个细节.)
归档时间: |
|
查看次数: |
18478 次 |
最近记录: |