如何在解析()后保留多字节字符

Yih*_*Xie 7 windows encoding parsing r

当我在Windows下解析具有非本机字符的R代码时,这些字符似乎变成了它们的Unicode表示形式,例如

Encoding('?')
# [1] "UTF-8"
parse(text="'?'")
# expression('<U+011F>')
parse(text="'?'", encoding='UTF-8')
# expression('<U+011F>')
deparse(parse(text="'?'")[1])
# [1] "expression(\"<U+011F>\")"
eval(parse(text="'?'"))
# [1] "<U+011F>"
Run Code Online (Sandbox Code Playgroud)

由于我的语言环境是简体中文,我可以解析具有中文字符的代码,例如

parse(text="'??'")
# expression('??')
Run Code Online (Sandbox Code Playgroud)

我的问题是,如何?在这个例子中保留字母之类的字符?或者至少我怎样才能deparse()在表达之后"重建"原始人物?

我的会话信息:

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
Run Code Online (Sandbox Code Playgroud)

The*_*ras 2

问题的根源在于(引用R 安装和管理手册):“R 支持底层操作系统可以处理的所有字符集。这些字符集根据当前区域设置进行解释”。不幸的是Windows没有支持 UTF-8 的区域设置

\n\n

现在,好的事情是 Rgui显然支持 UTF-8(向下滚动到 2.7.0 > 国际化)。不过,R 解析器仅适用于语言环境中支持的字符。因此,对我有用的解决方案是暂时更改 R 区域设置以Sys.setlocale()进行解析,然后在解析时使用iconv()转换为 UTF-8:

\n\n
> Sys.getlocale()\n[1] "LC_COLLATE=Greek_Greece.1253;LC_CTYPE=Greek_Greece.1253;LC_MONETARY=Greek_Greece.1253;LC_NUMERIC=C;LC_TIME=Greek_Greece.1253"\n> orig.locale <- Sys.getlocale("LC_CTYPE")\n> parse(text="\'\xe4\xbd\xa0\xe5\xa5\xbd\'")\nexpression(\'<U+4F60><U+597D>\')\n> Sys.setlocale(locale="Chinese")\n[1] "LC_COLLATE=Chinese (Simplified)_People\'s Republic of China.936;LC_CTYPE=Chinese (Simplified)_People\'s Republic of China.936;LC_MONETARY=Chinese (Simplified)_People\'s Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People\'s Republic of China.936"\n> a <- parse(text="\'\xe4\xbd\xa0\xe5\xa5\xbd\'")\n> a\nexpression(\'\xe4\xbd\xa0\xe5\xa5\xbd\')\n> Sys.setlocale(locale="Turkish")\n[1] "LC_COLLATE=Turkish_Turkey.1254;LC_CTYPE=Turkish_Turkey.1254;LC_MONETARY=Turkish_Turkey.1254;LC_NUMERIC=C;LC_TIME=Turkish_Turkey.1254"\n> b <- parse(text="\'\xc4\x9f\'")\n> b\nexpression(\'\xc4\x9f\')\n> Sys.setlocale(locale=orig.locale)\n[1] "LC_COLLATE=Greek_Greece.1253;LC_CTYPE=Greek_Greece.1253;LC_MONETARY=Greek_Greece.1253;LC_NUMERIC=C;LC_TIME=Greek_Greece.1253"\n> a\n[1] expression(\'\xce\x94\xce\xb3\xce\x8a\xce\x93\')\n> b\n[1] expression(\'\xcf\x80\')\n> ai <- iconv(a, from="CP936", to="UTF-8")\n> ai\n[1] "\xe4\xbd\xa0\xe5\xa5\xbd"\n> bi <- iconv(b, from="CP1254", to="UTF-8")\n> bi\n[1] "\xc4\x9f"\n
Run Code Online (Sandbox Code Playgroud)\n\n

希望这可以帮助!

\n