RStudio没有选择我告诉它在读取文件时使用的编码

Mau*_*res 18 csv encoding r utf-8 rstudio

我正在尝试在R中读取以下UTF-8编码文件,但每当我读取它时,unicode字符都没有正确编码:

在此输入图像描述

我用来处理文件的脚本如下:

defaultEncoding <- "UTF8"
detalheVotacaoMunicipioZonaTypes <- c("character", "character", "factor", "factor", "factor", "factor", "factor",
                                                     "factor", "factor", "factor", "factor", "factor", "numeric", 
                                                     "numeric", "numeric", "numeric", "numeric", "numeric",
                                                     "numeric", "numeric", "numeric", "numeric", "numeric", 
                                                     "numeric", "character", "character")

readDetalheVotacaoMunicipioZona <- function( fileName ) {
  fileConnection = file(fileName,encoding=defaultEncoding)
  contents <- readChar(fileConnection, file.info(fileName)$size)  
  close(fileConnection)
  contents <- gsub('"', "", contents)

  columnNames <- c("data_geracao", "hora_geracao", "ano_eleicao", "num_turno", "descricao_eleicao", "sigla_uf", "sigla_ue",
                   "codigo_municipio", "nome_municipio", "numero_zona", "codigo_cargo", "descricao_cargo", "qtd_aptos", 
                   "qtd_secoes", "qtd_secoes_agregadas", "qtd_aptos_tot", "qtd_secoes_tot", "qtd_comparecimento",
                   "qtd_abstencoes", "qtd_votos_nominais", "qtd_votos_brancos", "qtd_votos_nulos", "qtd_votos_legenda", 
                   "qtd_votos_anulados", "data_ult_totalizacao", "hora_ult_totalizacao")

  read.csv(text=contents, 
           colClasses=detalheVotacaoMunicipioZonaTypes,
           sep=";", 
           col.names=columnNames, 
           fileEncoding=defaultEncoding,
           header=FALSE)
}
Run Code Online (Sandbox Code Playgroud)

我读了以UTF-8编码发送的文件,删除所有引号(引用偶数,所以我需要清理它们),然后将内容提供给read.csv.它正确地读取和处理文件,但似乎它没有使用我给它的编码信息.

我该怎么做才能让它使用UTF-8来读取这个文件?

如果它有任何区别,我在OSX上使用RStudio.

smc*_*mci 14

此问题是由设置的错误区域设置引起的,无论是在RStudio还是命令行R内:

  1. 如果问题只发生在RStudio而不是命令行R中,请转到RStudio-> Preferences:General,告诉我们"默认文本编码:"设置为什么,单击"更改"并尝试Windows-1252,UTF-8或ISO8859 -1('latin1')(如果总是想要提示,请"询问").屏幕截图位于底部.让我们知道哪一个有效!

  2. 如果问题也发生在命令行R中,请执行以下操作:

locale -m你的Mac上,并告诉我们它是否支持CP1252否则ISO8859-1("latin1的")?如果需要,转储支持的语言环境列表.(你可能会告诉我们你的MacOS版本.)

对于这两种语言环境,请尝试更改为该语言环境:

# first try Windows CP1252, although that's almost surely not supported on Mac:
Sys.setlocale("LC_ALL", "pt_PT.1252") # Make sure not to omit the `"LC_ALL",` first argument, it will fail.
Sys.setlocale("LC_ALL", "pt_PT.CP1252") # the name might need to be 'CP1252'

# next try IS08859-1(/'latin1'), this works for me:
Sys.setlocale("LC_ALL", "pt_PT.ISO8859-1")

# Try "pt_PT.UTF-8" too...

# in your program, make sure the Sys.setlocale worked, sprinkle this assertion in your code before attempting to read.csv:
stopifnot(Sys.getlocale('LC_CTYPE') == "pt_PT.ISO8859-1")
Run Code Online (Sandbox Code Playgroud)

这应该工作.严格来说,Sys.setlocale()命令应该~/.Rprofile用于启动,而不是在R会话或源代码中.但是Sys.setlocale()可能会失败,所以请注意这一点.另外,Sys.getlocale()像我一样,尽早并经常在设置代码中声明.(实际上,read.csv应该弄清楚它使用的编码是否与语言环境兼容,如果没有则发出警告或错误).

让我们知道哪个修复有效!我试图更一般地记录这个,所以我们可以找出正确的增强.

  1. RStudio首选项的屏幕截图更改默认文本编码菜单: 在此输入图像描述

  • 在RStudio 1.0.143中,我找不到RStudio-> Preferences:General.在选项>代码中没有"默认文本编码"选项 (2认同)

Pau*_*oso 5

这对我来说可以。

\n\n

您是否尝试更改/重置区域设置?

\n\n

就我而言,它适用于

\n\n
Sys.setlocale(category = "LC_ALL", locale = "Portuguese_Portugal.1252")\n\nd <- read.table(text=readClipboard(), header=TRUE, sep = \';\')\n\nhead(d)\n\n1  25/04/2014  22:29:30  2012  1 ELEI\xc3\x87\xc3\x83O MUNICIPAL 2012 PB  20419    20419      ITAPORANGA  33  13 VEREADOR 17157\n2  25/04/2014  22:29:30  2012  1 ELEI\xc3\x87\xc3\x83O MUNICIPAL 2012 PB  20770    20770           MALTA  51  11 PREFEITO  4677\n3  25/04/2014  22:29:30  2012  1 ELEI\xc3\x87\xc3\x83O MUNICIPAL 2012 PB  21091    21091     OLHO D\'\xc3\x81GUA  32  13 VEREADOR  6653\n4  25/04/2014  22:29:30  2012  1 ELEI\xc3\x87\xc3\x83O MUNICIPAL 2012 PB  21113    21113        OLIVEDOS  23  13 VEREADOR  3243\n...\n
Run Code Online (Sandbox Code Playgroud)\n