在Ruby中,如何使用UTF-8编码这个奇怪的角色?

Mor*_*ori 4 ruby encoding utf-8

我正在从感染了各种奇怪字符的外部数据库导入内容,例如

> str
=> "Nature’s Variety, Best Friends Animal Society team up"
Run Code Online (Sandbox Code Playgroud)

从上下文中似乎’表示右单引号.在cp1252编码中:

> str.encode('cp1252')
=> "Nature\xE2\x80\x99s Variety, Best Friends Animal Society team up"
Run Code Online (Sandbox Code Playgroud)

那么如何将其转换为正确的UTF-8字符呢?这是我尝试过的:

> str.encode('UTF-8')
=> "Nature’s Variety, Best Friends Animal Society team up"

> str.encode('cp1252').encode('UTF-8')
=> "Nature’s Variety, Best Friends Animal Society team up"

> str.encode('UTF-8', invalid: :replace, replace: '?', undef: :replace)
=> "Nature’s Variety, Best Friends Animal Society team up"

> str.encode('cp1252').encode('UTF-8', invalid: :replace, replace: '?', undef: :replace)                                                                  
=> "Nature’s Variety, Best Friends Animal Society team up"
Run Code Online (Sandbox Code Playgroud)

我宁愿找到一种方法来进行通用的重新编码,这样它就能处理所有这些错误编码的字符.但如果我必须做个人搜索和替换.但我也无法做到这一点:

> str.encode('cp1252').gsub('\xE2/x80/x99', "'")
=> "Nature\xE2\x80\x99s Variety, Best Friends Animal Society team up"

> str.encode('cp1252').gsub(%r{\xE2\x80\x99}, "'")
SyntaxError: unexpected tIDENTIFIER, expecting $end

> str.encode('cp1252').gsub(Regexp.escape('\xE2\x80\x99'), "'")
=> "Nature\xE2\x80\x99s Variety, Best Friends Animal Society team up"
Run Code Online (Sandbox Code Playgroud)

我想这样做,但我甚至无法将这些字符粘贴到我的REPL中:

> str.gsub('’', "'")
Run Code Online (Sandbox Code Playgroud)

当我尝试时,我得到:

> str.gsub('C"b,b,b
* "', ",")
=> "Nature’s Variety, Best Friends Animal Society team up"
Run Code Online (Sandbox Code Playgroud)

令人沮丧.有关如何将其正确编码为UTF-8的任何建议?

编辑:在请求字符串中的实际字节时:

> str.bytes.to_a.join(' ')
=> "78 97 116 117 114 101 195 162 226 130 172 226 132 162 115 32 86 97 114 105 101 116 121 44 32 66 101 115 116 32 70 114 105 101 110 100 115 32 65 110 105 109 97 108 32 83 111 99 105 101 116 121 32 116 101 97 109 32 117 112"
Run Code Online (Sandbox Code Playgroud)

Max*_*Max 5

修复了从MySQL修复不正确的字符串编码的问题.您需要设置正确的编码然后强制它.

fallback = {
  "\u0081" => "\x81".force_encoding("CP1252"),
  "\u008D" => "\x8D".force_encoding("CP1252"),
  "\u008F" => "\x8F".force_encoding("CP1252"),
  "\u0090" => "\x90".force_encoding("CP1252"),
  "\u009D" => "\x9D".force_encoding("CP1252")
}

str.encode('CP1252', fallback: fallback).force_encoding('UTF-8')
Run Code Online (Sandbox Code Playgroud)

根据您的数据,可能不需要回退,但它确保它不会通过处理CP1252中未定义的五个字节来引发错误.