我是R的新手,所以我希望你能帮助我.
我想使用gsub删除所有标点符号,除了句号和减号,这样我就可以在我的数据中保留小数点和负号.
例
我的数据框z具有以下数据:
[,1] [,2]
[1,] "1" "6"
[2,] "2@" "7.235"
[3,] "3" "8"
[4,] "4" "$9"
[5,] "£5" "-10"
Run Code Online (Sandbox Code Playgroud)
我想用来gsub("[[:punct:]]", "", z)
删除标点符号.
电流输出
> gsub("[[:punct:]]", "", z)
[,1] [,2]
[1,] "1" "6"
[2,] "2" "7235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "10"
Run Code Online (Sandbox Code Playgroud)
但是,我希望保留" - "符号和"." 标志.
期望的输出
PSEUDO CODE:
> gsub("[[:punct:]]", "", z, except(".", "-") )
[,1] [,2]
[1,] "1" "6"
[2,] "2" "7.235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "-10"
Run Code Online (Sandbox Code Playgroud)
任何想法如何使一些字符免于gsub()函数?
ags*_*udy 13
你可以放回一些像这样的比赛:
sub("([.-])|[[:punct:]]", "\\1", as.matrix(z))
X..1. X..2.
[1,] "1" "6"
[2,] "2" "7.235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "-10"
Run Code Online (Sandbox Code Playgroud)
在这里,我保持.
和-
.
我想,下一步是将结果强制转换为数字矩阵,所以我在这里结合了两个步骤,如下所示:
matrix(as.numeric(sub("([.-])|[[:punct:]]", "\\1", as.matrix(z))),ncol=2)
[,1] [,2]
[1,] 1 6.000
[2,] 2 7.235
[3,] 3 8.000
[4,] 4 9.000
[5,] 5 -10.000
Run Code Online (Sandbox Code Playgroud)
你可以试试这个代码。我发现它很方便。
x <- c('6,345', '7.235', '8', '$9', '-10')
gsub("[^[:alnum:]\\-\\.\\s]", "", x)
[1] "6345" "7.235" "8" "9" "-10"
x <- c('1', '2@', '3', '4', '£5')
gsub("[^[:alnum:]\\-\\.\\s]", "", x)
[1] "1" "2" "3" "4" "5"
Run Code Online (Sandbox Code Playgroud)
此代码{gsub("[^[:alnum:]]", "", x))} 删除了不包含字母数字术语的所有内容。然后我们添加到例外列表中。这里我们添加连字符(\-)、句号(\.)和空格(\s)得到gsub("[^[:alnum:]\-\.\s]", "", x)。现在它删除了所有不是字母数字、连字符、句号和空格的内容。
以下是一些使用基本 R和删除/替换函数来限制R 中的通用字符类的选项:(g)sub
stringr
(g)sub
和 perl=TRUE
您可以将[[:punct:]]
括号表达式与[:punct:]
POSIX 字符类一起使用,并使用(?!\.)
负前瞻来限制它,这将要求紧跟在右边的字符不等于.
:
(?!\.)[[:punct:]] # Excluding a dot only
(?![.-])[[:punct:]] # Excluding a dot and hyphen
Run Code Online (Sandbox Code Playgroud)
要匹配一个或多个事件,请将其用非捕获组包装,然后将+
量词设置为该组:
(?:(?!\.)[[:punct:]])+ # Excluding a dot only
(?:(?![.-])[[:punct:]])+ # Excluding a dot and hyphen
Run Code Online (Sandbox Code Playgroud)
请注意,当您删除找到的匹配项时,两个表达式将产生相同的结果,但是,当您需要用其他字符串/字符替换时,量化将允许使用一次替换模式更改整个连续字符块。
具有stringr
替换/删除功能
之前进入细节,记住,PCRE[[:punct:]]
与使用(g)sub
不匹配的相同字符stringr
由动力驱动的正则表达式功能ICU regex库。您需要[\p{P}\p{S}]
改用,请参阅带有 stringi/ICU 的 R/regex:为什么 '+' 被视为非 [:punct:] 字符?
ICU 正则表达式库有一个很好的特性,可以与字符类一起使用,称为字符类减法。
因此,您编写了字符类,例如,所有标点匹配类,例如[\p{P}\p{S}]
,然后您想“排除”(=减去)一个字符或两个或三个字符,或字符的整个子类。您可以使用两种符号:
[\p{P}\p{S}&&[^.]] # Excluding a dot
[\p{P}\p{S}--[.]] # Excluding a dot
[\p{P}\p{S}&&[^.-]] # Excluding a dot and hyphen
[\p{P}\p{S}--[.-]] # Excluding a dot and hyphen
Run Code Online (Sandbox Code Playgroud)
要使用此方法匹配 1+ 个连续出现,您不需要任何包装组,只需使用+
:
[\p{P}\p{S}&&[^.]]+ # Excluding a dot
[\p{P}\p{S}--[.]]+ # Excluding a dot
[\p{P}\p{S}&&[^.-]]+ # Excluding a dot and hyphen
[\p{P}\p{S}--[.-]]+ # Excluding a dot and hyphen
Run Code Online (Sandbox Code Playgroud)
请参阅带有输出的 R 演示测试:
x <- "Abc.123#&*xxx(x-y-z)???? some@other!chars."
gsub("(?!\\.)[[:punct:]]", "", x, perl=TRUE)
## => [1] "Abc.123xxxxyz someotherchars."
gsub("(?!\\.)[[:punct:]]", "~", x, perl=TRUE)
## => [1] "Abc.123~~~xxx~x~y~z~~~~~ some~other~chars."
gsub("(?:(?!\\.)[[:punct:]])+", "~", x, perl=TRUE)
## => [1] "Abc.123~xxx~x~y~z~ some~other~chars."
library(stringr)
stringr::str_remove_all(x, "[\\p{P}\\p{S}&&[^.]]") # Same as "[\\p{P}\\p{S}--[.]]"
## => [1] "Abc.123xxxxyz someotherchars."
stringr::str_replace_all(x, "[\\p{P}\\p{S}&&[^.]]", "~")
## => [1] "Abc.123~~~xxx~x~y~z~~~~~ some~other~chars."
stringr::str_replace_all(x, "[\\p{P}\\p{S}&&[^.]]+", "~") # Same as "[\\p{P}\\p{S}--[.]]+"
## => [1] "Abc.123~xxx~x~y~z~ some~other~chars."
Run Code Online (Sandbox Code Playgroud)