R的UTF-8编码问题

efl*_*s89 3 html encoding text r utf-8

试图解析墨西哥参议院的参议院声明,但是网页的UTF-8编码有问题.

这个HTML明确地通过:

library(rvest)
Senate<-html("http://comunicacion.senado.gob.mx/index.php/informacion/versiones/19675-version-estenografica-de-la-reunion-ordinaria-de-las-comisiones-unidas-de-puntos-constitucionales-de-anticorrupcion-y-participacion-ciudadana-y-de-estudios-legislativos-segunda.html")
Run Code Online (Sandbox Code Playgroud)

这是一个网页的例子:

"CONTINÚA EL SENADOR CORRAL JURADO: Nosotros decimos. Entonces, bueno, el tema es que hay dos rutas señor presidente y también tratar, por ejemplo, de forzar ahora.   Una decisión de pre dictamen a lo mejor lo único que va a hacer es complicar más las cosas."
Run Code Online (Sandbox Code Playgroud)

可以看出,口音和"ñ"都很好.

这个问题出现在其他一些htmls(同一个域!)中.例如:

 Senate2<-html("http://comunicacion.senado.gob.mx/index.php/informacion/versiones/14694-version-estenografica-de-la-sesion-de-la-comision-permanente-celebrada-el-13-de-agosto-de-2014.html")
Run Code Online (Sandbox Code Playgroud)

我明白了:

 "-EL C. DIPUTADO ADAME ALEMÃÂN: En consecuencia está a discusión la propuesta. Y para hablar sobre este asunto, se le concede el uso de la palabra a la senadora…….."
Run Code Online (Sandbox Code Playgroud)

在第二篇文章中,我尝试了iconv()并将html()上的编码参数强制转换为encoding ="UTF-8",但仍然得到相同的结果.

我还使用W3 Validator检查了网页编码,它似乎是UTF-8并且没有问题.

使用gsub似乎效率不高,因为编码使用相同的"代码"下载不同的字符:

í - ÃÂ
á - ÃÂ
ó - ÃÂ
Run Code Online (Sandbox Code Playgroud)

几乎是新鲜的想法.

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] grDevices utils     datasets  graphics  stats     grid      methods   base     

other attached packages:
 [1] stringi_0.4-1    magrittr_1.5     selectr_0.2-3    rvest_0.2.0      ggplot2_1.0.0    geosphere_1.3-11 fields_7.1      
 [8] maps_2.3-9       spam_1.0-1       sp_1.0-17        SOAR_0.99-11     data.table_1.9.4 reshape2_1.4.1   xlsx_0.5.7      
[15] xlsxjars_0.6.1   rJava_0.9-6     

loaded via a namespace (and not attached):
 [1] bitops_1.0-6     chron_2.3-45     colorspace_1.2-4 digest_0.6.8     evaluate_0.5.5   formatR_1.0      gtable_0.1.2    
 [8] httr_0.6.1       knitr_1.8        lattice_0.20-29  MASS_7.3-35      munsell_0.4.2    plotly_0.5.17    plyr_1.8.1      
[15] proto_0.3-10     Rcpp_0.11.3      RCurl_1.95-4.5   RJSONIO_1.3-0    scales_0.2.4     stringr_0.6.2    tools_3.1.2     
[22] XML_3.98-1.1    
Run Code Online (Sandbox Code Playgroud)

更新: 这似乎是问题:

stri_enc_mark(Senate2)
[1] "ASCII"  "latin1" "latin1" "ASCII"  "ASCII"  "latin1" "ASCII"  "ASCII"  "latin1"
Run Code Online (Sandbox Code Playgroud)

......等等.显然,问题在于latin1:

stri_enc_isutf8(texto2)
    [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
Run Code Online (Sandbox Code Playgroud)

如何强制latin1纠正UTF-8字符串?当被stringi"翻译"时,它似乎做错了,给我前面描述的问题.

Dom*_*ois 6

编码是21世纪最糟糕的头痛之一.但是这里有一个解决方案:

# Set-up remote reading connection, specifying UTF-8 as encoding.
addr <- "http://comunicacion.senado.gob.mx/index.php/informacion/versiones/14694-version-estenografica-de-la-sesion-de-la-comision-permanente-celebrada-el-13-de-agosto-de-2014.html"
read.html.con <- file(description = addr, encoding = "UTF-8", open = "rt")

# Read in cycles of 1000 characters
html.text <- c()
i = 0
while(length(html.text) == i) {
    html.text <- append(html.text, readChar(con = read.html.con,nchars = 1000))
    cat(i <- i + 1)
}

# close reading connection
close(read.html.con)

# Paste everything back together & at the same time, convert from UTF-8 
# to... UTF-8 with iconv(). I know. It's crazy. Encodings are secretely 
# meant to drive us insane.
content <- paste0(iconv(html.text, from="UTF-8", to = "UTF-8"), collapse="")

# Set-up local writing
outpath <- "~/htmlfile.html"

# Create file connection specifying "UTF-8" as encoding, once more
# (Although this one makes sense)
write.html.con <- file(description = outpath, open = "w", encoding = "UTF-8")

# Use capture.output to dump everything back into the html file
# Using cat inside it will prevent having [1]'s, quotes and such parasites
capture.output(cat(content), file = write.html.con)

# Close the output connection
close(write.html.con)
Run Code Online (Sandbox Code Playgroud)

然后,您就可以在自己喜欢的浏览器中打开新创建的文件了.您应该看到它完好无损,并准备好使用您选择的工具重新打开它!