Mla*_*vić 11
在Ruby和其他任何地方都没有明确的方法可以做到这一点:
str = 'foo' # start with a simple string
# => "foo"
str.encoding
# => #<Encoding:UTF-8> # which is UTF-8 encoded
str.bytes.to_a
# => [102, 111, 111] # as you can see, it consists of three bytes 102, 111 and 111
str.encode!('us-ascii') # now we will recode the string to 8-bit us-ascii encoding
# => "foo"
str.encoding
# => #<Encoding:US-ASCII>
str.bytes.to_a
# => [102, 111, 111] # see, same three bytes
str.encode!('windows-1251') # let us try some cyrillic
# => "foo"
str.encoding
# => #<Encoding:Windows-1251>
str.bytes.to_a
# => [102, 111, 111] # see, the same three again!
Run Code Online (Sandbox Code Playgroud)
当然,您可以对文本进行一些统计分析,并消除文本无效的编码,但从理论上讲,这不是可解决的问题.
And*_*iep 10
对于大多数多字节编码,可以以编程方式检测无效字节序列.由于Ruby默认情况下会处理所有字符串,因此UTF-8您可以检查字符串是否在有效字段中给出UTF-8:
# encoding: UTF-8
# -------------------------------------------
str = "Partly valid\xE4 UTF-8 encoding: äöüß"
str.valid_encoding?
# => false
str.scrub('').valid_encoding?
# => true
Run Code Online (Sandbox Code Playgroud)
此外,如果字符串无效UTF-8编码,但您知道实际的字符编码,则可以将字符串转换为UTF-8编码.
示例
有时,您最终处于这样一种情况,即您知道输入文件的编码是UTF-8或CP1252(也称Windows-1252).
检查它是哪种编码并转换为UTF-8(如有必要):
# encoding: UTF-8
# ------------------------------------------------------
test = "String in CP1252 encoding: \xE4\xF6\xFC\xDF"
File.open( 'input_file', 'w' ) {|f| f.write(test)}
str = File.read( 'input_file' )
unless str.valid_encoding?
str.encode!( 'UTF-8', 'CP1252', invalid: :replace, undef: :replace, replace: '?' )
end #unless
# => "String CP1252 encoding: äöüß"
Run Code Online (Sandbox Code Playgroud)
=======
备注
以编程方式可以检测大多数多字节编码,如UTF-8(在Ruby中,参见:#valid_encoding?),具有很高的可靠性.仅16字节后,随机字节序列有效UTF-8的概率仅为0.01%.(相比之下,依靠UTF-8 BOM)
但是,不可能以编程方式检测(in)单字节编码的有效性,如CP1252或ISO-8859-1.因此,上面的代码片段不起作用,即检测String是否是有效CP1252编码.
尽管UTF-8作为网络中的默认编码越来越受欢迎,但CP1252其他Latin1风味在西方国家仍然非常流行,特别是在北美.请注意,有几个单字节编码非常相似,但略有不同CP1252(aka Windows-1252).例如:ISO-8859-1,ISO-8859-15
| 归档时间: |
|
| 查看次数: |
12371 次 |
| 最近记录: |