String 构造函数中缺少边界检查消除?

Ami*_*adi 11 performance jit jvm-hotspot bounds-check-elimination protobuf-java

查看 UTF8 解码性能,我注意到 protobuf 的性能UnsafeProcessor::decodeUtf8优于String(byte[] bytes, int offset, int length, Charset charset)以下非 ascii 字符串:"Quizdeltagerne spiste jordb\xc3\xa6r med fl\xc3\x98de, mens cirkusklovnen"

\n

我试图找出原因,所以我复制了相关代码,String并将数组访问替换为不安全的数组访问,与 相同UnsafeProcessor::decodeUtf8。\n以下是 JMH 基准测试结果:

\n
Benchmark                       Mode  Cnt    Score   Error  Units\nStringBenchmark.safeDecoding    avgt   10  127.107 \xc2\xb1 3.642  ns/op\nStringBenchmark.unsafeDecoding  avgt   10  100.915 \xc2\xb1 4.090  ns/op\n
Run Code Online (Sandbox Code Playgroud)\n

我认为差异是由于缺少边界检查消除而导致的,特别是因为checkBoundsOffCount(offset, length, bytes.length)String(byte[] bytes, int offset, int length, Charset charset).

\n

这个问题真的是缺少边界检查消除吗?

\n

这是我使用 OpenJDK 17 和 JMH 进行基准测试的代码。请注意,这只是String(byte[] bytes, int offset, int length, Charset charset)构造函数代码的一部分,并且仅适用于此特定的德语字符串。\n静态方法是从 复制的String。\n查找// the unsafe version:指示我将安全访问替换为不安全的位置的注释。

\n
    private static byte[] safeDecode(byte[] bytes, int offset, int length) {\n        checkBoundsOffCount(offset, length, bytes.length);\n        int sl = offset + length;\n        int dp = 0;\n        byte[] dst = new byte[length];\n        while (offset < sl) {\n            int b1 = bytes[offset];\n            // the unsafe version:\n            // int b1 = UnsafeUtil.getByte(bytes, offset);\n            if (b1 >= 0) {\n                dst[dp++] = (byte)b1;\n                offset++;\n                continue;\n            }\n            if ((b1 == (byte)0xc2 || b1 == (byte)0xc3) &&\n                    offset + 1 < sl) {\n                // the unsafe version:\n                // int b2 = UnsafeUtil.getByte(bytes, offset + 1);\n                int b2 = bytes[offset + 1];\n                if (!isNotContinuation(b2)) {\n                    dst[dp++] = (byte)decode2(b1, b2);\n                    offset += 2;\n                    continue;\n                }\n            }\n            // anything not a latin1, including the repl\n            // we have to go with the utf16\n            break;\n        }\n        if (offset == sl) {\n            if (dp != dst.length) {\n                dst = Arrays.copyOf(dst, dp);\n            }\n            return dst;\n        }\n\n        return dst;\n    }\n
Run Code Online (Sandbox Code Playgroud)\n

跟进

\n

显然,如果我将 while 循环条件从 更改offset < sl0 <= offset && offset < sl\n我会在两个版本中获得相似的性能:

\n
Benchmark                       Mode  Cnt    Score    Error  Units\nStringBenchmark.safeDecoding    avgt   10  100.802 \xc2\xb1 13.147  ns/op\nStringBenchmark.unsafeDecoding  avgt   10  102.774 \xc2\xb1 3.893  ns/op\n
Run Code Online (Sandbox Code Playgroud)\n

结论

\n

HotSpot 开发人员将这个问题提出为https://bugs.openjdk.java.net/browse/JDK-8278518

\n

优化此代码最终使上述 Latin1 字符串的解码速度提高了 2.5 倍。

\n

此 C2 优化缩小了与以下基准之间令人难以置信的超过7 倍的差距,并将登陆 Java 19。commonBranchFirstcommonBranchSecond

\n
Benchmark                         Mode  Cnt     Score    Error  Units\nLoopBenchmark.commonBranchFirst   avgt   25  1737.111 \xc2\xb1 56.526  ns/op\nLoopBenchmark.commonBranchSecond  avgt   25   232.798 \xc2\xb1 12.676  ns/op\n
Run Code Online (Sandbox Code Playgroud)\n
@State(Scope.Thread)\n@BenchmarkMode(Mode.AverageTime)\n@OutputTimeUnit(TimeUnit.NANOSECONDS)\npublic class LoopBenchmark {\n\n  private final boolean[] mostlyTrue = new boolean[1000];\n\n  @Setup\n  public void setup() {\n    for (int i = 0; i < mostlyTrue.length; i++) {\n      mostlyTrue[i] = i % 100 > 0;\n    }\n  }\n\n  @Benchmark\n  public int commonBranchFirst() {\n    int i = 0;\n    while (i < mostlyTrue.length) {\n      if (mostlyTrue[i]) {\n        i++;\n      } else {\n        i += 2;\n      }\n    }\n    return i;\n  }\n\n  @Benchmark\n  public int commonBranchSecond() {\n    int i = 0;\n    while (i < mostlyTrue.length) {\n      if (!mostlyTrue[i]) {\n        i += 2;\n      } else {\n        i++;\n      }\n    }\n    return i;\n  }\n}\n
Run Code Online (Sandbox Code Playgroud)\n

Ser*_*nov 3

为了测量您感兴趣的分支,特别是while循环变热时的情况,我使用了以下基准:

\n
@State(Scope.Thread)\n@BenchmarkMode(Mode.AverageTime)\n@OutputTimeUnit(TimeUnit.NANOSECONDS)\npublic class StringConstructorBenchmark {\n  private byte[] array;\n\n  @Setup\n  public void setup() {\n    String str = "Quizdeltagerne spiste jordb\xc3\xa6r med fl\xc3\xb8de, mens cirkusklovnen. \xd0\xaf";\n    array = str.getBytes(StandardCharsets.UTF_8);\n  }\n\n  @Benchmark\n  public String newString()  {\n      return new String(array, 0, array.length, StandardCharsets.UTF_8);\n  }\n}\n
Run Code Online (Sandbox Code Playgroud)\n

事实上,通过修改构造函数,它确实带来了显着的改进:

\n
//baseline\nBenchmark                             Mode  Cnt    Score   Error  Units\nStringConstructorBenchmark.newString  avgt   50  173,092 \xc2\xb1 3,048  ns/op\n\n//patched\nBenchmark                             Mode  Cnt    Score   Error  Units\nStringConstructorBenchmark.newString  avgt   50  126,908 \xc2\xb1 2,355  ns/op\n
Run Code Online (Sandbox Code Playgroud)\n

这可能是一个热点问题:由于某种原因优化编译器未能消除while循环内的数组边界检查。我猜原因是offset在循环内进行了修改:

\n
//baseline\nBenchmark                             Mode  Cnt    Score   Error  Units\nStringConstructorBenchmark.newString  avgt   50  173,092 \xc2\xb1 3,048  ns/op\n\n//patched\nBenchmark                             Mode  Cnt    Score   Error  Units\nStringConstructorBenchmark.newString  avgt   50  126,908 \xc2\xb1 2,355  ns/op\n
Run Code Online (Sandbox Code Playgroud)\n

我还通过查看了代码LinuxPerfAsmProfiler,这里是基线的链接https://gist.github.com/stsypanov/d2524f98477d633fb1d4a2510fedeea6这是用于修补构造函数的https://gist.github.com/stsypanov /16c787e4f9fa3dd122522f16331b68b7

\n

一个人应该注意什么?让我们找到对应的代码int b1 = bytes[offset];(第538行)。在基线中我们有这样的:

\n
  3.62%           \xe2\x94\x82\xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c1c:   mov    %ebx,%ecx\n  2.29%           \xe2\x94\x82\xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c1e:   mov    %edx,%r9d\n  2.22%           \xe2\x94\x82\xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c21:   mov    (%rsp),%r8                   ;*iload_2 {reexecute=0 rethrow=0 return_oop=0}\n                  \xe2\x94\x82\xe2\x94\x82            \xe2\x94\x82                                                            ; - java.lang.String::&lt;init&gt;@107 (line 537)\n  2.32%           \xe2\x86\x98\xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c25:   cmp    %r13d,%ecx\n                   \xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c28:   jge    0x00007fed70eb5388           ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}\n                   \xe2\x94\x82            \xe2\x94\x82                                                            ; - java.lang.String::&lt;init&gt;@110 (line 537)\n  3.05%            \xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c2e:   cmp    0x8(%rsp),%ecx\n                   \xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c32:   jae    0x00007fed70eb5319\n  2.38%            \xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c38:   mov    %r8,(%rsp)\n  2.64%            \xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c3c:   movslq %ecx,%r8\n  2.46%            \xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c3f:   mov    %rax,%rbx\n  3.44%            \xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c42:   sub    %r8,%rbx\n  2.62%            \xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c45:   add    $0x1,%rbx\n  2.64%            \xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c49:   and    $0xfffffffffffffffe,%rbx\n  2.30%            \xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c4d:   mov    %ebx,%r8d\n  3.08%            \xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c50:   add    %ecx,%r8d\n  2.55%            \xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c53:   movslq %r8d,%r8\n  2.45%            \xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c56:   add    $0xfffffffffffffffe,%r8\n  2.13%            \xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c5a:   cmp    (%rsp),%r8\n                   \xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c5e:   jae    0x00007fed70eb5319\n  3.36%            \xe2\x94\x82            \xe2\x94\x82  0x00007fed70eb4c64:   mov    %ecx,%edi                    ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}\n                   \xe2\x94\x82            \xe2\x94\x82                                                            ; - java.lang.String::&lt;init&gt;@113 (line 538)\n  2.86%            \xe2\x94\x82           \xe2\x86\x97\xe2\x94\x82  0x00007fed70eb4c66:   movsbl 0x10(%r14,%rdi,1),%r8d       ;*baload {reexecute=0 rethrow=0 return_oop=0}\n                   \xe2\x94\x82           \xe2\x94\x82\xe2\x94\x82                                                            ; - java.lang.String::&lt;init&gt;@115 (line 538)\n  2.48%            \xe2\x94\x82           \xe2\x94\x82\xe2\x94\x82  0x00007fed70eb4c6c:   mov    %r9d,%edx\n  2.26%            \xe2\x94\x82           \xe2\x94\x82\xe2\x94\x82  0x00007fed70eb4c6f:   inc    %edx                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}\n                   \xe2\x94\x82           \xe2\x94\x82\xe2\x94\x82                                                            ; - java.lang.String::&lt;init&gt;@127 (line 540)\n  3.28%            \xe2\x94\x82           \xe2\x94\x82\xe2\x94\x82  0x00007fed70eb4c71:   mov    %edi,%ebx\n  2.44%            \xe2\x94\x82           \xe2\x94\x82\xe2\x94\x82  0x00007fed70eb4c73:   inc    %ebx                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}\n                   \xe2\x94\x82           \xe2\x94\x82\xe2\x94\x82                                                            ; - java.lang.String::&lt;init&gt;@134 (line 541)\n  2.35%            \xe2\x94\x82           \xe2\x94\x82\xe2\x94\x82  0x00007fed70eb4c75:   test   %r8d,%r8d\n                   \xe2\x95\xb0           \xe2\x94\x82\xe2\x94\x82  0x00007fed70eb4c78:   jge    0x00007fed70eb4c04           ;*iflt {reexecute=0 rethrow=0 return_oop=0}\n                               \xe2\x94\x82\xe2\x94\x82                                                            ; - java.lang.String::&lt;init&gt;@120 (line 539)\n
Run Code Online (Sandbox Code Playgroud)\n

在修补代码中相应的部分是

\n
 17.28%           \xe2\x94\x82\xe2\x94\x82  0x00007f6b88eb6061:   mov    %edx,%r10d                   ;*iload_2 {reexecute=0 rethrow=0 return_oop=0}\n                  \xe2\x94\x82\xe2\x94\x82                                                            ; - java.lang.String::&lt;init&gt;@107 (line 537)\n  0.11%           \xe2\x86\x98\xe2\x94\x82  0x00007f6b88eb6064:   test   %r10d,%r10d\n                   \xe2\x94\x82  0x00007f6b88eb6067:   jl     0x00007f6b88eb669c           ;*iflt {reexecute=0 rethrow=0 return_oop=0}\n                   \xe2\x94\x82                                                            ; - java.lang.String::&lt;init&gt;@108 (line 537)\n  0.39%            \xe2\x94\x82  0x00007f6b88eb606d:   cmp    %r13d,%r10d\n                   \xe2\x94\x82  0x00007f6b88eb6070:   jge    0x00007f6b88eb66d0           ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}\n                   \xe2\x94\x82                                                            ; - java.lang.String::&lt;init&gt;@114 (line 537)\n  0.66%            \xe2\x94\x82  0x00007f6b88eb6076:   mov    %ebx,%r9d\n 13.70%            \xe2\x94\x82  0x00007f6b88eb6079:   cmp    0x8(%rsp),%r10d\n  0.01%            \xe2\x94\x82  0x00007f6b88eb607e:   jae    0x00007f6b88eb6671\n  0.14%            \xe2\x94\x82  0x00007f6b88eb6084:   movsbl 0x10(%r14,%r10,1),%edi       ;*baload {reexecute=0 rethrow=0 return_oop=0}\n                   \xe2\x94\x82                                                            ; - java.lang.String::&lt;init&gt;@119 (line 538)\n  0.37%            \xe2\x94\x82  0x00007f6b88eb608a:   mov    %r9d,%ebx\n  0.99%            \xe2\x94\x82  0x00007f6b88eb608d:   inc    %ebx                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}\n                   \xe2\x94\x82                                                            ; - java.lang.String::&lt;init&gt;@131 (line 540)\n 12.88%            \xe2\x94\x82  0x00007f6b88eb608f:   movslq %r9d,%rsi                    ;*bastore {reexecute=0 rethrow=0 return_oop=0}\n                   \xe2\x94\x82                                                            ; - java.lang.String::&lt;init&gt;@196 (line 548)\n  0.17%            \xe2\x94\x82  0x00007f6b88eb6092:   mov    %r10d,%edx\n  0.39%            \xe2\x94\x82  0x00007f6b88eb6095:   inc    %edx                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}\n                   \xe2\x94\x82                                                            ; - java.lang.String::&lt;init&gt;@138 (line 541)\n  0.96%            \xe2\x94\x82  0x00007f6b88eb6097:   test   %edi,%edi\n  0.02%            \xe2\x94\x82  0x00007f6b88eb6099:   jl     0x00007f6b88eb60dc           ;*iflt {reexecute=0 rethrow=0 return_oop=0}\n                   \xe2\x94\x82                                                            ; - java.lang.String::&lt;init&gt;@124 (line 539)\n
Run Code Online (Sandbox Code Playgroud)\n

if_icmpge在字节码指令之间的基线中,aload_1我们进行了边界检查,但在修补代码中没有边界检查。

\n

所以你最初的假设是正确的:它是关于缺失边界检查消除的。

\n

UPD我必须纠正我的答案:事实证明,边界检查仍然存在:

\n
13.70%            \xe2\x94\x82  0x00007f6b88eb6079:   cmp    0x8(%rsp),%r10d\n 0.01%            \xe2\x94\x82  0x00007f6b88eb607e:   jae    0x00007f6b88eb6671\n
Run Code Online (Sandbox Code Playgroud)\n

我指出的代码是编译器引入的东西,但它什么也没做。问题本身仍然与边界检查有关,因为其显式声明临时解决了该问题。

\n