Read a file in R with mixed character encodings

Question

Read a file in R with mixed character encodings

I'm trying to read tables into R from HTML pages that are mostly encoded in UTF-8 (and declare <meta charset="utf-8">) but have some strings in some other encodings (I think Windows-1252 or ISO 8859-1). Here's an example. I want everything decoded properly into an R data frame. XML::readHTMLTable takes an encoding argument but doesn't seem to allow one to try multiple encodings.

因此，在R中，如何为输入文件的每一行尝试几种编码？在Python 3中，我将执行以下操作：

with open('file', 'rb') as o:
    for line in o:
        try:
            line = line.decode('UTF-8')
        except UnicodeDecodeError:
            line = line.decode('Windows-1252')

Run Code Online (Sandbox Code Playgroud)

Answer 1

Kod*_*ist 5

似乎确实有用于猜测字符编码的 R 库函数，例如stringi::stri_enc_detect，但如果可能，最好使用更简单的确定性方法按顺序尝试一组固定的编码。看起来最好的方法是利用这样一个事实，即当iconv无法转换字符串时，它会返回NA.

linewise.decode = function(path)
    sapply(readLines(path), USE.NAMES = F, function(line) {
        if (validUTF8(line))
            return(line)
        l2 = iconv(line, "Windows-1252", "UTF-8")
        if (!is.na(l2))
            return(l2)
        l2 = iconv(line, "Shift-JIS", "UTF-8")
        if (!is.na(l2))
            return(l2)
        stop("Encoding not detected")
    })

Run Code Online (Sandbox Code Playgroud)

如果您创建一个测试文件

linewise.decode = function(path)
    sapply(readLines(path), USE.NAMES = F, function(line) {
        if (validUTF8(line))
            return(line)
        l2 = iconv(line, "Windows-1252", "UTF-8")
        if (!is.na(l2))
            return(l2)
        l2 = iconv(line, "Shift-JIS", "UTF-8")
        if (!is.na(l2))
            return(l2)
        stop("Encoding not detected")
    })

Run Code Online (Sandbox Code Playgroud)

然后linewise.decode("inptest")确实返回

[1] "This line is ASCII"                    
[2] "This line is UTF-8: I like ?"          
[3] "This line is Windows-1252: Müller"     
[4] "This line is Shift-JIS: ???????"

Run Code Online (Sandbox Code Playgroud)

要使用linewise.decodewith XML::readHTMLTable，只需说类似XML::readHTMLTable(linewise.decode("http://example.com")).

归档时间：	6 年，8 月前
查看次数：	546 次
最近记录：	6 年，8 月前