为什么 UTF-8 BOM 字节 efbbbf 可以替换为 \ufeff?

aar*_*chu 6 java byte-order-mark

字节顺序标记(BOM)UTF-8EF BB BF,如在说明中的Unicode 9的部分23.8规范(搜索“签名”)。

Java中的许多解决方案都只是一个简单的一行代码:

 replace("\uFEFF", "")
Run Code Online (Sandbox Code Playgroud)

我不明白这为什么有效。

这是我的测试代码。我在调用后检查二进制文件String#replace,我发现 EF BB BF 已被删除。查看此代码在 IdeOne.com 上实时运行

太神奇了。为什么这样做?

@Test
public void shit() throws Exception{
    byte[] b = new byte[]{-17,-69,-65, 97,97,97};//EF BB BF 61 61 61
    char[] c = new char[10];
    new InputStreamReader(new ByteArrayInputStream(b),"UTF-8").read(c);
    byte[] bytes = new StringBuilder().append(c).toString().replace("\uFEFF", "").getBytes();//
    for(byte bt: bytes){//61 61 61, we can see EF BB BF is indeed removed
        System.out.println(bt);
    }
}
Run Code Online (Sandbox Code Playgroud)

Sub*_*mal 6

The reason is that a unicode text should start with the byte order mark (except UTF-8 where it is not recommended mandatory[1]).

from Wikipedia

The byte order mark (BOM) is a Unicode character, U+FEFF BYTE ORDER MARK (BOM), whose appearance as a magic number at the start of a text stream ...
...
The BOM is encoded in the same scheme as the rest of the document ...

Which means this special character (\uFEFF) must also be encoded in UTF-8.

UTF-8 can encode Unicode code points in one to four bytes.

  • code points which can be represented with 7 bits are encoded in one byte, the highest bit is always zero 0xxx xxxx
  • all other code points encoded in multiple bytes depending on the number of bits, the left set bits of the first byte represent the number of bytes used for the encoding, e.g. 110x xxxx means the encoding is represented by two bytes, continuation bytes always start with 10xx xxxx (the x bits can be used for the code points)

The code points in the range U+0000 - U+007F can be encoded with one byte.
The code points in the range U+0080 - U+07FF can be encoded with two bytes. The code points in the range U+0800 - U+FFFF can be encoded with three bytes.

A detailed explanation is on Wikipedia

For the BOM we need three bytes.

hex    FE       FF
binary 11111110 11111111
Run Code Online (Sandbox Code Playgroud)

encode the bits in UTF-8

pattern for three byte encoding 1110 xxxx  10xx xxxx  10xx xxxx
the bits of the code point           1111    11 1011    11 1111
result                          1110 1111  1011 1011  1011 1111
in hex                          EF         BB         BF
Run Code Online (Sandbox Code Playgroud)

EF BB BF sounds already familiar. ;-)

The byte sequence EF BB BF is nothing else than the BOM encoded in UTF-8.

As the byte order mark has no meaning for UTF-8 it is not used in Java.

encoding the BOM character as UTF-8

jshell> "\uFEFF".getBytes("UTF-8")
$1 ==> byte[3] { -17, -69, -65 }  // EF BB BF
Run Code Online (Sandbox Code Playgroud)

Hence when the file is read the byte sequence gets decoded to \uFEFF.

For encoding e.g. UTF-16 the BOM is added

jshell> " ".getBytes("UTF-16")
$2 ==> byte[4] { -2, -1, 0, 32 }  // FE FF + the encoded SPACE
Run Code Online (Sandbox Code Playgroud)

[1] cited from: http://www.unicode.org/versions/Unicode9.0.0/ch23.pdf

Although there are never any questions of byte order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked. As with a BOM in UTF-16, this sequence of bytes will be extremely rare at the beginning of text files in other character encodings.


Chr*_*son 5

InputStreamReader 将 UTF-8 编码的字节序列 (b) 解码为 UTF-16BE,并在此过程中将 UTF-8 BOM 转换为 UTF-16BE BOM (\uFEFF)。选择 UTF-16BE 作为目标编码,因为Charset默认为这种行为:

https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html

UTF-16 字符集由 RFC 2781 指定;它们所基于的转换格式在 ISO 10646-1 的修正案 1 中指定,也在 Unicode 标准中进行了描述。

UTF-16 字符集使用 16 位数量,因此对字节顺序很敏感。在这些编码中,流的字节顺序可以由 Unicode 字符 '\uFEFF' 表示的初始字节顺序标记指示。字节顺序标记处理如下:

解码时,UTF-16BE 和 UTF-16LE 字符集将初始字节顺序标记解释为零宽度非中断空间;编码时,它们不写入字节顺序标记。

解码时,UTF-16 字符集解释输入流开头的字节顺序标记以指示流的字节顺序,但如果没有字节顺序标记,则默认为大端;编码时,它使用大端字节序并写入一个大端字节序标记。

请参阅 JLS 3.1 以了解为什么 String 的内部编码是 UTF-16:

https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.1

Java 编程语言使用 UTF-16 编码以 16 位代码单元的序列表示文本。

String#getBytes()以平台的默认编码返回一个字节序列,对于您的系统来说,它似乎是 UTF-8。

概括

序列EF BB BF(UTF-8 BOM)解码所述字节序列转换成时被转换为FE FF(UTF-16BE BOM)字符串使用InputStreamReader的,因为编码java.lang.String中与默认字符集是UTF-16存在 BOM 的情况下。替换 UTF-16BE BOM 并调用String#getBytes() 后,字符串被解码为 UTF-8(您平台的默认字符集),您会看到没有 BOM 的原始字节序列。