Mau*_*res 18 csv encoding r utf-8 rstudio
我正在尝试在R中读取以下UTF-8编码文件,但每当我读取它时,unicode字符都没有正确编码:

我用来处理文件的脚本如下:
defaultEncoding <- "UTF8"
detalheVotacaoMunicipioZonaTypes <- c("character", "character", "factor", "factor", "factor", "factor", "factor",
"factor", "factor", "factor", "factor", "factor", "numeric",
"numeric", "numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric", "numeric",
"numeric", "character", "character")
readDetalheVotacaoMunicipioZona <- function( fileName ) {
fileConnection = file(fileName,encoding=defaultEncoding)
contents <- readChar(fileConnection, file.info(fileName)$size)
close(fileConnection)
contents <- gsub('"', "", contents)
columnNames <- c("data_geracao", "hora_geracao", "ano_eleicao", "num_turno", "descricao_eleicao", "sigla_uf", "sigla_ue",
"codigo_municipio", "nome_municipio", "numero_zona", "codigo_cargo", "descricao_cargo", "qtd_aptos",
"qtd_secoes", "qtd_secoes_agregadas", "qtd_aptos_tot", "qtd_secoes_tot", "qtd_comparecimento",
"qtd_abstencoes", "qtd_votos_nominais", "qtd_votos_brancos", "qtd_votos_nulos", "qtd_votos_legenda",
"qtd_votos_anulados", "data_ult_totalizacao", "hora_ult_totalizacao")
read.csv(text=contents,
colClasses=detalheVotacaoMunicipioZonaTypes,
sep=";",
col.names=columnNames,
fileEncoding=defaultEncoding,
header=FALSE)
}
Run Code Online (Sandbox Code Playgroud)
我读了以UTF-8编码发送的文件,删除所有引号(引用偶数,所以我需要清理它们),然后将内容提供给read.csv.它正确地读取和处理文件,但似乎它没有使用我给它的编码信息.
我该怎么做才能让它使用UTF-8来读取这个文件?
如果它有任何区别,我在OSX上使用RStudio.
smc*_*mci 14
此问题是由设置的错误区域设置引起的,无论是在RStudio还是命令行R内:
如果问题只发生在RStudio而不是命令行R中,请转到RStudio-> Preferences:General,告诉我们"默认文本编码:"设置为什么,单击"更改"并尝试Windows-1252,UTF-8或ISO8859 -1('latin1')(如果总是想要提示,请"询问").屏幕截图位于底部.让我们知道哪一个有效!
如果问题也发生在命令行R中,请执行以下操作:
做locale -m你的Mac上,并告诉我们它是否支持CP1252否则ISO8859-1("latin1的")?如果需要,转储支持的语言环境列表.(你可能会告诉我们你的MacOS版本.)
对于这两种语言环境,请尝试更改为该语言环境:
# first try Windows CP1252, although that's almost surely not supported on Mac:
Sys.setlocale("LC_ALL", "pt_PT.1252") # Make sure not to omit the `"LC_ALL",` first argument, it will fail.
Sys.setlocale("LC_ALL", "pt_PT.CP1252") # the name might need to be 'CP1252'
# next try IS08859-1(/'latin1'), this works for me:
Sys.setlocale("LC_ALL", "pt_PT.ISO8859-1")
# Try "pt_PT.UTF-8" too...
# in your program, make sure the Sys.setlocale worked, sprinkle this assertion in your code before attempting to read.csv:
stopifnot(Sys.getlocale('LC_CTYPE') == "pt_PT.ISO8859-1")
Run Code Online (Sandbox Code Playgroud)
这应该工作.严格来说,Sys.setlocale()命令应该~/.Rprofile用于启动,而不是在R会话或源代码中.但是Sys.setlocale()可能会失败,所以请注意这一点.另外,Sys.getlocale()像我一样,尽早并经常在设置代码中声明.(实际上,read.csv应该弄清楚它使用的编码是否与语言环境兼容,如果没有则发出警告或错误).
让我们知道哪个修复有效!我试图更一般地记录这个,所以我们可以找出正确的增强.

这对我来说可以。
\n\n您是否尝试更改/重置区域设置?
\n\n就我而言,它适用于
\n\nSys.setlocale(category = "LC_ALL", locale = "Portuguese_Portugal.1252")\n\nd <- read.table(text=readClipboard(), header=TRUE, sep = \';\')\n\nhead(d)\n\n1 25/04/2014 22:29:30 2012 1 ELEI\xc3\x87\xc3\x83O MUNICIPAL 2012 PB 20419 20419 ITAPORANGA 33 13 VEREADOR 17157\n2 25/04/2014 22:29:30 2012 1 ELEI\xc3\x87\xc3\x83O MUNICIPAL 2012 PB 20770 20770 MALTA 51 11 PREFEITO 4677\n3 25/04/2014 22:29:30 2012 1 ELEI\xc3\x87\xc3\x83O MUNICIPAL 2012 PB 21091 21091 OLHO D\'\xc3\x81GUA 32 13 VEREADOR 6653\n4 25/04/2014 22:29:30 2012 1 ELEI\xc3\x87\xc3\x83O MUNICIPAL 2012 PB 21113 21113 OLIVEDOS 23 13 VEREADOR 3243\n...\nRun Code Online (Sandbox Code Playgroud)\n