如何检查字符是否为utf-8

log*_*han 8 ruby ruby-on-rails

如何通过ruby | ror检查字符集是否采用utf-8编码?

Mla*_*vić 11

在Ruby和其他任何地方都没有明确的方法可以做到这一点:

str = 'foo' # start with a simple string
# => "foo" 
str.encoding
# => #<Encoding:UTF-8> # which is UTF-8 encoded
str.bytes.to_a
# => [102, 111, 111] # as you can see, it consists of three bytes 102, 111 and 111
str.encode!('us-ascii') # now we will recode the string to 8-bit us-ascii encoding
# => "foo" 
str.encoding
# => #<Encoding:US-ASCII> 
str.bytes.to_a
# => [102, 111, 111] # see, same three bytes
str.encode!('windows-1251') # let us try some cyrillic
# => "foo" 
str.encoding
# => #<Encoding:Windows-1251> 
str.bytes.to_a
# => [102, 111, 111] # see, the same three again!
Run Code Online (Sandbox Code Playgroud)

当然,您可以对文本进行一些统计分析,并消除文本无效的编码,但从理论上讲,这不是可解决的问题.

  • [String#valid_encoding?](http://ruby-doc.org/core-2.1.0/String.html#method-i-valid_encoding-3F) 怎么样?示例:`"部分有效\xE4 UTF-8 编码:äöüß".valid_encoding?` (2认同)

And*_*iep 10

检查UTF-8有效性

对于大多数多字节编码,可以以编程方式检测无效字节序列.由于Ruby默认情况下会处理所有字符串,因此UTF-8您可以检查字符串是否在有效字段中给出UTF-8:

# encoding: UTF-8
# -------------------------------------------
str = "Partly valid\xE4 UTF-8 encoding: äöüß"

str.valid_encoding?
   # => false

str.scrub('').valid_encoding?
   # => true
Run Code Online (Sandbox Code Playgroud)

转换编码

此外,如果字符串无效UTF-8编码,但您知道实际的字符编码,则可以将字符串转换为UTF-8编码.

示例
有时,您最终处于这样一种情况,即您知道输入文件的编码是UTF-8CP1252(也称Windows-1252).
检查它是哪种编码并转换为UTF-8(如有必要):

# encoding: UTF-8
# ------------------------------------------------------
test = "String in CP1252 encoding: \xE4\xF6\xFC\xDF"
File.open( 'input_file', 'w' ) {|f| f.write(test)}

str  = File.read( 'input_file' )

unless str.valid_encoding?
  str.encode!( 'UTF-8', 'CP1252', invalid: :replace, undef: :replace, replace: '?' )
end #unless
   # => "String CP1252 encoding: äöüß"
Run Code Online (Sandbox Code Playgroud)

=======
备注

  • 以编程方式可以检测大多数多字节编码,如UTF-8(在Ruby中,参见:#valid_encoding?),具有很高的可靠性.仅16字节后,随机字节序列有效UTF-8的概率仅为0.01%.(相比之下,依靠UTF-8 BOM)

  • 但是,不可能以编程方式检测(in)单字节编码的有效性,如CP1252ISO-8859-1.因此,上面的代码片段不起作用,即检测String是否是有效CP1252编码.

  • 尽管UTF-8作为网络中的默认编码越来越受欢迎,但CP1252其他Latin1风味在西方国家仍然非常流行,特别是在北美.请注意,有几个单字节编码非常相似,但略有不同CP1252(aka Windows-1252).例如:ISO-8859-1,ISO-8859-15