计算Java String的UTF-8长度而不实际编码它

Tre*_*son 40 java utf-8

有谁知道标准Java库(任何版本)是否提供了一种计算字符串二进制编码长度的方法(在本例中为UTF-8)而不实际生成编码输出?换句话说,我正在寻找一个有效的等价物:

"some really long string".getBytes("UTF-8").length
Run Code Online (Sandbox Code Playgroud)

我需要为潜在的长序列化消息计算长度前缀.

McD*_*ell 47

这是基于UTF-8规范的实现:

public class Utf8LenCounter {
  public static int length(CharSequence sequence) {
    int count = 0;
    for (int i = 0, len = sequence.length(); i < len; i++) {
      char ch = sequence.charAt(i);
      if (ch <= 0x7F) {
        count++;
      } else if (ch <= 0x7FF) {
        count += 2;
      } else if (Character.isHighSurrogate(ch)) {
        count += 4;
        ++i;
      } else {
        count += 3;
      }
    }
    return count;
  }
}
Run Code Online (Sandbox Code Playgroud)

此实现不能容忍格式错误的字符串.

这是一个用于验证的JUnit 4测试:

public class LenCounterTest {
  @Test public void testUtf8Len() {
    Charset utf8 = Charset.forName("UTF-8");
    AllCodepointsIterator iterator = new AllCodepointsIterator();
    while (iterator.hasNext()) {
      String test = new String(Character.toChars(iterator.next()));
      Assert.assertEquals(test.getBytes(utf8).length,
                          Utf8LenCounter.length(test));
    }
  }

  private static class AllCodepointsIterator {
    private static final int MAX = 0x10FFFF; //see http://unicode.org/glossary/
    private static final int SURROGATE_FIRST = 0xD800;
    private static final int SURROGATE_LAST = 0xDFFF;
    private int codepoint = 0;
    public boolean hasNext() { return codepoint < MAX; }
    public int next() {
      int ret = codepoint;
      codepoint = next(codepoint);
      return ret;
    }
    private int next(int codepoint) {
      while (codepoint++ < MAX) {
        if (codepoint == SURROGATE_FIRST) { codepoint = SURROGATE_LAST + 1; }
        if (!Character.isDefined(codepoint)) { continue; }
        return codepoint;
      }
      return MAX;
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

请原谅紧凑的格式.

  • `if(ch <='\ x7F')++ count; 否则if(ch <='\ u07FF')count + = 2; 否则if(Character.isHighSurrogate(ch)){count + = 4; ++ I; } else count + = 3;`.但是+1包括一个超级全面的单元测试.:-) (6认同)
  • 这应该可行,但它不必要地复杂:你不需要支持5和6字节字符(因为Unicode不允许,UTF-16不能代表,代码点那么高),如果是'字符. isHighSurrogate(ch)`,那么你实际上并不需要确定代码点:UTF-16中需要代理对的代码点集与UTF-8中需要四个字节的代码点集相同.因此,如果不支持无效的代理对,那么你可以写 (3认同)

Aar*_*man 13

使用Guava的Utf8:

Utf8.encodedLength("some really long string")
Run Code Online (Sandbox Code Playgroud)