Ami*_*adi 11 performance jit jvm-hotspot bounds-check-elimination protobuf-java
查看 UTF8 解码性能,我注意到 protobuf 的性能UnsafeProcessor::decodeUtf8优于String(byte[] bytes, int offset, int length, Charset charset)以下非 ascii 字符串:"Quizdeltagerne spiste jordb\xc3\xa6r med fl\xc3\x98de, mens cirkusklovnen"。
我试图找出原因,所以我复制了相关代码,String并将数组访问替换为不安全的数组访问,与 相同UnsafeProcessor::decodeUtf8。\n以下是 JMH 基准测试结果:
Benchmark Mode Cnt Score Error Units\nStringBenchmark.safeDecoding avgt 10 127.107 \xc2\xb1 3.642 ns/op\nStringBenchmark.unsafeDecoding avgt 10 100.915 \xc2\xb1 4.090 ns/op\nRun Code Online (Sandbox Code Playgroud)\n我认为差异是由于缺少边界检查消除而导致的,特别是因为checkBoundsOffCount(offset, length, bytes.length)在String(byte[] bytes, int offset, int length, Charset charset).
这个问题真的是缺少边界检查消除吗?
\n这是我使用 OpenJDK 17 和 JMH 进行基准测试的代码。请注意,这只是String(byte[] bytes, int offset, int length, Charset charset)构造函数代码的一部分,并且仅适用于此特定的德语字符串。\n静态方法是从 复制的String。\n查找// the unsafe version:指示我将安全访问替换为不安全的位置的注释。
private static byte[] safeDecode(byte[] bytes, int offset, int length) {\n checkBoundsOffCount(offset, length, bytes.length);\n int sl = offset + length;\n int dp = 0;\n byte[] dst = new byte[length];\n while (offset < sl) {\n int b1 = bytes[offset];\n // the unsafe version:\n // int b1 = UnsafeUtil.getByte(bytes, offset);\n if (b1 >= 0) {\n dst[dp++] = (byte)b1;\n offset++;\n continue;\n }\n if ((b1 == (byte)0xc2 || b1 == (byte)0xc3) &&\n offset + 1 < sl) {\n // the unsafe version:\n // int b2 = UnsafeUtil.getByte(bytes, offset + 1);\n int b2 = bytes[offset + 1];\n if (!isNotContinuation(b2)) {\n dst[dp++] = (byte)decode2(b1, b2);\n offset += 2;\n continue;\n }\n }\n // anything not a latin1, including the repl\n // we have to go with the utf16\n break;\n }\n if (offset == sl) {\n if (dp != dst.length) {\n dst = Arrays.copyOf(dst, dp);\n }\n return dst;\n }\n\n return dst;\n }\nRun Code Online (Sandbox Code Playgroud)\n显然,如果我将 while 循环条件从 更改offset < sl为0 <= offset && offset < sl\n我会在两个版本中获得相似的性能:
Benchmark Mode Cnt Score Error Units\nStringBenchmark.safeDecoding avgt 10 100.802 \xc2\xb1 13.147 ns/op\nStringBenchmark.unsafeDecoding avgt 10 102.774 \xc2\xb1 3.893 ns/op\nRun Code Online (Sandbox Code Playgroud)\nHotSpot 开发人员将这个问题提出为https://bugs.openjdk.java.net/browse/JDK-8278518。
\n优化此代码最终使上述 Latin1 字符串的解码速度提高了 2.5 倍。
\n此 C2 优化缩小了与以下基准之间令人难以置信的超过7 倍的差距,并将登陆 Java 19。commonBranchFirstcommonBranchSecond
Benchmark Mode Cnt Score Error Units\nLoopBenchmark.commonBranchFirst avgt 25 1737.111 \xc2\xb1 56.526 ns/op\nLoopBenchmark.commonBranchSecond avgt 25 232.798 \xc2\xb1 12.676 ns/op\nRun Code Online (Sandbox Code Playgroud)\n@State(Scope.Thread)\n@BenchmarkMode(Mode.AverageTime)\n@OutputTimeUnit(TimeUnit.NANOSECONDS)\npublic class LoopBenchmark {\n\n private final boolean[] mostlyTrue = new boolean[1000];\n\n @Setup\n public void setup() {\n for (int i = 0; i < mostlyTrue.length; i++) {\n mostlyTrue[i] = i % 100 > 0;\n }\n }\n\n @Benchmark\n public int commonBranchFirst() {\n int i = 0;\n while (i < mostlyTrue.length) {\n if (mostlyTrue[i]) {\n i++;\n } else {\n i += 2;\n }\n }\n return i;\n }\n\n @Benchmark\n public int commonBranchSecond() {\n int i = 0;\n while (i < mostlyTrue.length) {\n if (!mostlyTrue[i]) {\n i += 2;\n } else {\n i++;\n }\n }\n return i;\n }\n}\nRun Code Online (Sandbox Code Playgroud)\n
为了测量您感兴趣的分支,特别是while循环变热时的情况,我使用了以下基准:
@State(Scope.Thread)\n@BenchmarkMode(Mode.AverageTime)\n@OutputTimeUnit(TimeUnit.NANOSECONDS)\npublic class StringConstructorBenchmark {\n private byte[] array;\n\n @Setup\n public void setup() {\n String str = "Quizdeltagerne spiste jordb\xc3\xa6r med fl\xc3\xb8de, mens cirkusklovnen. \xd0\xaf";\n array = str.getBytes(StandardCharsets.UTF_8);\n }\n\n @Benchmark\n public String newString() {\n return new String(array, 0, array.length, StandardCharsets.UTF_8);\n }\n}\nRun Code Online (Sandbox Code Playgroud)\n事实上,通过修改构造函数,它确实带来了显着的改进:
\n//baseline\nBenchmark Mode Cnt Score Error Units\nStringConstructorBenchmark.newString avgt 50 173,092 \xc2\xb1 3,048 ns/op\n\n//patched\nBenchmark Mode Cnt Score Error Units\nStringConstructorBenchmark.newString avgt 50 126,908 \xc2\xb1 2,355 ns/op\nRun Code Online (Sandbox Code Playgroud)\n这可能是一个热点问题:由于某种原因优化编译器未能消除while循环内的数组边界检查。我猜原因是offset在循环内进行了修改:
//baseline\nBenchmark Mode Cnt Score Error Units\nStringConstructorBenchmark.newString avgt 50 173,092 \xc2\xb1 3,048 ns/op\n\n//patched\nBenchmark Mode Cnt Score Error Units\nStringConstructorBenchmark.newString avgt 50 126,908 \xc2\xb1 2,355 ns/op\nRun Code Online (Sandbox Code Playgroud)\n我还通过查看了代码LinuxPerfAsmProfiler,这里是基线的链接https://gist.github.com/stsypanov/d2524f98477d633fb1d4a2510fedeea6这是用于修补构造函数的https://gist.github.com/stsypanov /16c787e4f9fa3dd122522f16331b68b7
一个人应该注意什么?让我们找到对应的代码int b1 = bytes[offset];(第538行)。在基线中我们有这样的:
3.62% \xe2\x94\x82\xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c1c: mov %ebx,%ecx\n 2.29% \xe2\x94\x82\xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c1e: mov %edx,%r9d\n 2.22% \xe2\x94\x82\xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c21: mov (%rsp),%r8 ;*iload_2 {reexecute=0 rethrow=0 return_oop=0}\n \xe2\x94\x82\xe2\x94\x82 \xe2\x94\x82 ; - java.lang.String::<init>@107 (line 537)\n 2.32% \xe2\x86\x98\xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c25: cmp %r13d,%ecx\n \xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c28: jge 0x00007fed70eb5388 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}\n \xe2\x94\x82 \xe2\x94\x82 ; - java.lang.String::<init>@110 (line 537)\n 3.05% \xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c2e: cmp 0x8(%rsp),%ecx\n \xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c32: jae 0x00007fed70eb5319\n 2.38% \xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c38: mov %r8,(%rsp)\n 2.64% \xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c3c: movslq %ecx,%r8\n 2.46% \xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c3f: mov %rax,%rbx\n 3.44% \xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c42: sub %r8,%rbx\n 2.62% \xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c45: add $0x1,%rbx\n 2.64% \xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c49: and $0xfffffffffffffffe,%rbx\n 2.30% \xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c4d: mov %ebx,%r8d\n 3.08% \xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c50: add %ecx,%r8d\n 2.55% \xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c53: movslq %r8d,%r8\n 2.45% \xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c56: add $0xfffffffffffffffe,%r8\n 2.13% \xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c5a: cmp (%rsp),%r8\n \xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c5e: jae 0x00007fed70eb5319\n 3.36% \xe2\x94\x82 \xe2\x94\x82 0x00007fed70eb4c64: mov %ecx,%edi ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}\n \xe2\x94\x82 \xe2\x94\x82 ; - java.lang.String::<init>@113 (line 538)\n 2.86% \xe2\x94\x82 \xe2\x86\x97\xe2\x94\x82 0x00007fed70eb4c66: movsbl 0x10(%r14,%rdi,1),%r8d ;*baload {reexecute=0 rethrow=0 return_oop=0}\n \xe2\x94\x82 \xe2\x94\x82\xe2\x94\x82 ; - java.lang.String::<init>@115 (line 538)\n 2.48% \xe2\x94\x82 \xe2\x94\x82\xe2\x94\x82 0x00007fed70eb4c6c: mov %r9d,%edx\n 2.26% \xe2\x94\x82 \xe2\x94\x82\xe2\x94\x82 0x00007fed70eb4c6f: inc %edx ;*iinc {reexecute=0 rethrow=0 return_oop=0}\n \xe2\x94\x82 \xe2\x94\x82\xe2\x94\x82 ; - java.lang.String::<init>@127 (line 540)\n 3.28% \xe2\x94\x82 \xe2\x94\x82\xe2\x94\x82 0x00007fed70eb4c71: mov %edi,%ebx\n 2.44% \xe2\x94\x82 \xe2\x94\x82\xe2\x94\x82 0x00007fed70eb4c73: inc %ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0}\n \xe2\x94\x82 \xe2\x94\x82\xe2\x94\x82 ; - java.lang.String::<init>@134 (line 541)\n 2.35% \xe2\x94\x82 \xe2\x94\x82\xe2\x94\x82 0x00007fed70eb4c75: test %r8d,%r8d\n \xe2\x95\xb0 \xe2\x94\x82\xe2\x94\x82 0x00007fed70eb4c78: jge 0x00007fed70eb4c04 ;*iflt {reexecute=0 rethrow=0 return_oop=0}\n \xe2\x94\x82\xe2\x94\x82 ; - java.lang.String::<init>@120 (line 539)\nRun Code Online (Sandbox Code Playgroud)\n在修补代码中相应的部分是
\n 17.28% \xe2\x94\x82\xe2\x94\x82 0x00007f6b88eb6061: mov %edx,%r10d ;*iload_2 {reexecute=0 rethrow=0 return_oop=0}\n \xe2\x94\x82\xe2\x94\x82 ; - java.lang.String::<init>@107 (line 537)\n 0.11% \xe2\x86\x98\xe2\x94\x82 0x00007f6b88eb6064: test %r10d,%r10d\n \xe2\x94\x82 0x00007f6b88eb6067: jl 0x00007f6b88eb669c ;*iflt {reexecute=0 rethrow=0 return_oop=0}\n \xe2\x94\x82 ; - java.lang.String::<init>@108 (line 537)\n 0.39% \xe2\x94\x82 0x00007f6b88eb606d: cmp %r13d,%r10d\n \xe2\x94\x82 0x00007f6b88eb6070: jge 0x00007f6b88eb66d0 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}\n \xe2\x94\x82 ; - java.lang.String::<init>@114 (line 537)\n 0.66% \xe2\x94\x82 0x00007f6b88eb6076: mov %ebx,%r9d\n 13.70% \xe2\x94\x82 0x00007f6b88eb6079: cmp 0x8(%rsp),%r10d\n 0.01% \xe2\x94\x82 0x00007f6b88eb607e: jae 0x00007f6b88eb6671\n 0.14% \xe2\x94\x82 0x00007f6b88eb6084: movsbl 0x10(%r14,%r10,1),%edi ;*baload {reexecute=0 rethrow=0 return_oop=0}\n \xe2\x94\x82 ; - java.lang.String::<init>@119 (line 538)\n 0.37% \xe2\x94\x82 0x00007f6b88eb608a: mov %r9d,%ebx\n 0.99% \xe2\x94\x82 0x00007f6b88eb608d: inc %ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0}\n \xe2\x94\x82 ; - java.lang.String::<init>@131 (line 540)\n 12.88% \xe2\x94\x82 0x00007f6b88eb608f: movslq %r9d,%rsi ;*bastore {reexecute=0 rethrow=0 return_oop=0}\n \xe2\x94\x82 ; - java.lang.String::<init>@196 (line 548)\n 0.17% \xe2\x94\x82 0x00007f6b88eb6092: mov %r10d,%edx\n 0.39% \xe2\x94\x82 0x00007f6b88eb6095: inc %edx ;*iinc {reexecute=0 rethrow=0 return_oop=0}\n \xe2\x94\x82 ; - java.lang.String::<init>@138 (line 541)\n 0.96% \xe2\x94\x82 0x00007f6b88eb6097: test %edi,%edi\n 0.02% \xe2\x94\x82 0x00007f6b88eb6099: jl 0x00007f6b88eb60dc ;*iflt {reexecute=0 rethrow=0 return_oop=0}\n \xe2\x94\x82 ; - java.lang.String::<init>@124 (line 539)\nRun Code Online (Sandbox Code Playgroud)\nif_icmpge在字节码指令之间的基线中,aload_1我们进行了边界检查,但在修补代码中没有边界检查。
所以你最初的假设是正确的:它是关于缺失边界检查消除的。
\nUPD我必须纠正我的答案:事实证明,边界检查仍然存在:
\n13.70% \xe2\x94\x82 0x00007f6b88eb6079: cmp 0x8(%rsp),%r10d\n 0.01% \xe2\x94\x82 0x00007f6b88eb607e: jae 0x00007f6b88eb6671\nRun Code Online (Sandbox Code Playgroud)\n我指出的代码是编译器引入的东西,但它什么也没做。问题本身仍然与边界检查有关,因为其显式声明临时解决了该问题。
\n