您可以使用regex/grep检查可打印ASCII字符范围之外的字符的十六进制值:
x <- '????'
grep( "[^\x20-\x7F]",x )
#[1] 1
grep( "[^\x20-\x7F]","Normal text" )
#integer(0)
Run Code Online (Sandbox Code Playgroud)
如果要将非打印("控制")字符视为"英语",可以将hte first参数中的字符类范围扩展为grep以"\ x01"开头.有关?regex使用字符类主题的更多信息,请参阅参考资料.?Quotes有关如何将字符指定为Unicode,十六进制或八进制值的详细信息,请参阅参考资料.
R.oo包具有可能有用的转换函数:
library(R.oo)
?intToChar
?charToInt
Run Code Online (Sandbox Code Playgroud)
Henrik Bengtsson认为将这些包含在他的包中的事实告诉我,在base/default R中没有一个方便的方法来做这个.他是一个长期使用R/guRu.
看到另一个答案促使这项努力看起来很简单:
> is.na( iconv( c(x, "OrdinaryASCII") , "", "ASCII") )
[1] TRUE FALSE
Run Code Online (Sandbox Code Playgroud)
你可以决定是否字符串包含非拉丁/非ASCII字符iconv和grep
# My example, because you didn't add your data
characters <- c("????, satisfação, ??????, Work, Awareness, Potential, für")
# First you convert string to vector of words
characters.unlist <- unlist(strsplit(characters, split=", "))
# Then find indices of words with non-ASCII characters using ICONV
characters.non.ASCII <- grep("characters.unlist", iconv(characters.unlist, "latin1", "ASCII", sub="characters.unlist"))
# subset original vector of words to exclude words with non-ASCII characters
data <- characters.unlist[-characters.non.ASCII]
# convert vector back to a string
dat.1 <- paste(data, collapse = ", ")
# Now if you run
characters.non.ASCII
[1] 1 2 3 7
Run Code Online (Sandbox Code Playgroud)
这意味着第一,第二,第三和第七个索引是非ASCII字符,在我的情况下,1,2,3和7对应于:"ないでさ,satisfação,катынь和für.
你也可以跑
dat.1 #and the output will be all ASCII charaters
[1] "Work, Awareness, Potential"
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1021 次 |
| 最近记录: |