有人可以解释 jlString 上逐字符迭代的显着性能差异吗?

Ser*_*nov 6 java string performance benchmarking

我尝试了两种方法在 java.lang.String 上逐个字符地迭代,发现它们令人困惑。该基准对其进行了总结:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(jvmArgsAppend = {"-Xms2g", "-Xmx2g"})
public class CharByCharIterationBenchmark {

  @Benchmark
  public void toCharArray(Data data, Blackhole b) {
    char[] chars = data.string.toCharArray();
    for (char ch : chars) {
      b.consume(ch);
    }
  }

  @Benchmark
  public void charAt(Data data, Blackhole b) {
    String string = data.string;
    int length = string.length();
    for (int i = 0; i < length; i++) {
      b.consume(string.charAt(i));
    }
  }

  @State(Scope.Thread)
  public static class Data {
    String string;

    @Param({"true", "false"})
    private boolean latin;

    @Param({"5", "10", "50", "100"})
    private int length;

    @Setup
    public void setup() {
      String alphabet = latin
        ? "abcdefghijklmnopqrstuvwxyz"        // English
        : "????????????????????????????????"; // Russian

      RandomStringGenerator generator = new RandomStringGenerator();

      string = generator.randomString(alphabet, length);
    }
  }
Run Code Online (Sandbox Code Playgroud)

直观地说, 中描述的方法toCharArray()似乎不太有效,因为它分配了char[]Java 8的底层副本并编码byte[]char[]Java 9 和更新版本。但在实践中反之亦然:toCharArray()执行速度要快得多:

Java 8

                                 (latin)  (length)  Mode      Score     Error   Units
charAt                              true         5  avgt     21.051 ±   0.796   ns/op
charAt                              true        10  avgt     44.002 ±   2.324   ns/op
charAt                              true        50  avgt    221.068 ±   7.422   ns/op
charAt                              true       100  avgt    410.162 ±  13.441   ns/op

toCharArray                         true         5  avgt     16.819 ±   0.662   ns/op
toCharArray                         true        10  avgt     28.364 ±   0.663   ns/op
toCharArray                         true        50  avgt    110.910 ±   1.144   ns/op
toCharArray                         true       100  avgt    205.694 ±   1.669   ns/op

charAt:·gc.alloc.rate.norm          true         5  avgt     ? 10??              B/op
charAt:·gc.alloc.rate.norm          true        10  avgt     ? 10??              B/op
charAt:·gc.alloc.rate.norm          true        50  avgt     ? 10??              B/op
charAt:·gc.alloc.rate.norm          true       100  avgt     ? 10??              B/op

toCharArray:·gc.alloc.rate.norm     true         5  avgt     32.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     true        10  avgt     40.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     true        50  avgt    120.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     true       100  avgt    216.000 ±   0.001    B/op

charAt                              false        5  avgt     20.372 ±   0.406   ns/op
charAt                              false       10  avgt     39.962 ±   0.911   ns/op
charAt                              false       50  avgt    201.337 ±   3.752   ns/op
charAt                              false      100  avgt    410.530 ±  17.931   ns/op

toCharArray                         false        5  avgt     15.767 ±   0.606   ns/op
toCharArray                         false       10  avgt     26.258 ±   0.345   ns/op
toCharArray                         false       50  avgt    109.631 ±   1.064   ns/op
toCharArray                         false      100  avgt    205.815 ±   4.716   ns/op

charAt:·gc.alloc.rate.norm          false        5  avgt     ? 10??              B/op
charAt:·gc.alloc.rate.norm          false       10  avgt     ? 10??              B/op
charAt:·gc.alloc.rate.norm          false       50  avgt     ? 10??              B/op
charAt:·gc.alloc.rate.norm          false      100  avgt     ? 10??              B/op

toCharArray:·gc.alloc.rate.norm     false        5  avgt     32.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false       10  avgt     40.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false       50  avgt    120.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false      100  avgt    216.000 ±   0.001    B/op


Java 11


                                  (latin)  (length)  Mode     Score     Error   Units
charAt                               true         5  avgt    22.035 ±   1.557   ns/op
charAt                               true        10  avgt    41.800 ±   1.572   ns/op
charAt                               true        50  avgt   227.180 ±  18.655   ns/op
charAt                               true       100  avgt   474.719 ±  29.782   ns/op

toCharArray                          true         5  avgt    17.091 ±   0.662   ns/op
toCharArray                          true        10  avgt    26.167 ±   0.220   ns/op
toCharArray                          true        50  avgt   127.876 ±   2.106   ns/op
toCharArray                          true       100  avgt   244.449 ±   9.330   ns/op

charAt:·gc.alloc.rate.norm           true         5  avgt    ? 10??              B/op
charAt:·gc.alloc.rate.norm           true        10  avgt    ? 10??              B/op
charAt:·gc.alloc.rate.norm           true        50  avgt    ? 10??              B/op
charAt:·gc.alloc.rate.norm           true       100  avgt    ? 10??              B/op

toCharArray:·gc.alloc.rate.norm      true         5  avgt    32.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm      true        10  avgt    40.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm      true        50  avgt   120.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm      true       100  avgt   216.000 ±   0.001    B/op

charAt                              false         5  avgt    22.215 ±   2.064   ns/op
charAt                              false        10  avgt    45.606 ±   2.567   ns/op
charAt                              false        50  avgt   204.577 ±  18.302   ns/op
charAt                              false       100  avgt   404.056 ±  10.203   ns/op

toCharArray                         false         5  avgt    17.055 ±   0.556   ns/op
toCharArray                         false        10  avgt    29.254 ±   2.616   ns/op
toCharArray                         false        50  avgt   123.610 ±   5.033   ns/op
toCharArray                         false       100  avgt   226.174 ±   6.396   ns/op

charAt:·gc.alloc.rate.norm          false         5  avgt    ? 10??              B/op
charAt:·gc.alloc.rate.norm          false        10  avgt    ? 10??              B/op
charAt:·gc.alloc.rate.norm          false        50  avgt    ? 10??              B/op
charAt:·gc.alloc.rate.norm          false       100  avgt    ? 10??              B/op

toCharArray:·gc.alloc.rate.norm     false         5  avgt    32.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false        10  avgt    40.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false        50  avgt   120.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false       100  avgt   216.000 ±   0.001    B/op
Run Code Online (Sandbox Code Playgroud)

首先,我认为这里的原因与 Nitsan Wakart 的文章不稳定阅读惊喜”中描述的原因相同。但是,使用 perfasm 进行分析时,我发现代码中的最热点与char[]/byte[]字段访问无关:

           ??     0x00007fa638407dd9: jmp    0x00007fa638407e4c
           ??     0x00007fa638407ddb: nopl   0x0(%rax,%rax,1)
  4.96%    ??  ?  0x00007fa638407de0: shl    $0x3,%r11
  0.01%    ??  ?  0x00007fa638407de4: movzwl 0x10(%r11,%r13,2),%edx  ;*invokevirtual charAt {reexecute=0 rethrow=0 return_oop=0}
           ??  ?                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
           ??  ?                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  3.58%    ?? ??  0x00007fa638407dea: mov    %rsi,0x18(%rsp)
  1.87%    ?? ??  0x00007fa638407def: mov    %r8d,0x14(%rsp)
  4.18%    ?? ??  0x00007fa638407df4: mov    %edi,0x10(%rsp)
  0.04%    ?? ??  0x00007fa638407df8: mov    %rbx,0x8(%rsp)
  1.29%    ?? ??  0x00007fa638407dfd: mov    %r10,(%rsp)
  1.83%    ?? ??  0x00007fa638407e01: mov    %r9,0x70(%rsp)
  4.32%    ?? ??  0x00007fa638407e06: mov    %rax,0x60(%rsp)
  0.05%    ?? ??  0x00007fa638407e0b: mov    %r9,%rsi
  1.27%    ?? ??  0x00007fa638407e0e: nop
  1.88%    ?? ??  0x00007fa638407e0f: callq  0x00007fa630926e00  ; ImmutableOopMap{[96]=Oop [104]=Oop [112]=Oop [120]=Oop [0]=Oop [16]=NarrowOop [24]=Oop }
           ?? ??                                                ;*invokevirtual consume {reexecute=0 rethrow=0 return_oop=0}
           ?? ??                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@28 (line 35)
           ?? ??                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
           ?? ??                                                ;   {optimized virtual_call}
  5.71%    ?? ??  0x00007fa638407e14: inc    %ebp               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
           ?? ??                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@31 (line 34)
           ?? ??                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  0.05%    ?? ??  0x00007fa638407e16: cmp    0x14(%rsp),%ebp
  0.00%    ?? ??  0x00007fa638407e1a: jge    0x00007fa638407d87  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
           ?  ??                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@18 (line 34)
           ?  ??                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  3.05%    ?  ??  0x00007fa638407e20: mov    0x10(%rsp),%edi
  4.24%    ?  ??  0x00007fa638407e24: movsbl 0x14(%r12,%rdi,8),%ecx  ;*getfield coder {reexecute=0 rethrow=0 return_oop=0}
           ?  ??                                                ; - java.lang.String::isLatin1@7 (line 3266)
           ?  ??                                                ; - java.lang.String::charAt@1 (line 692)
           ?  ??                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
           ?  ??                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  0.86%    ?  ??  0x00007fa638407e2a: mov    0xc(%r12,%rdi,8),%r11d  ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
           ?  ??                                                ; - java.lang.String::charAt@8 (line 693)
           ?  ??                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
           ?  ??                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  1.67%    ?  ??  0x00007fa638407e2f: mov    0x60(%rsp),%rax
  1.71%    ?  ??  0x00007fa638407e34: mov    0x70(%rsp),%r9
  3.92%    ?  ??  0x00007fa638407e39: mov    (%rsp),%r10
  0.20%    ?  ??  0x00007fa638407e3d: mov    0x8(%rsp),%rbx
  1.44%    ?  ??  0x00007fa638407e42: mov    0x14(%rsp),%r8d
  1.70%    ?  ??  0x00007fa638407e47: mov    0x18(%rsp),%rsi    ;*aload_2 {reexecute=0 rethrow=0 return_oop=0}
           ?  ??                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@21 (line 35)
           ?  ??                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  3.93%    ?  ??  0x00007fa638407e4c: movslq %ebp,%r13          ;*invokestatic getChar {reexecute=0 rethrow=0 return_oop=0}
              ??                                                ; - java.lang.StringUTF16::charAt@7 (line 1268)
              ??                                                ; - java.lang.String::charAt@21 (line 695)
              ??                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
              ??                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  0.23%       ??  0x00007fa638407e4f: test   %ecx,%ecx
  0.00%      ???  0x00007fa638407e51: jne    0x00007fa638407e6b  ;*ifeq {reexecute=0 rethrow=0 return_oop=0}
             ???                                                ; - java.lang.String::charAt@4 (line 692)
             ???                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
             ???                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
             ???  0x00007fa638407e53: mov    0xc(%r12,%r11,8),%edx  ; implicit exception: dispatches to 0x00007fa638407fbc
             ???  0x00007fa638407e58: cmp    %edx,%ebp
             ???  0x00007fa638407e5a: jae    0x00007fa638407eb0
             ???  0x00007fa638407e5c: shl    $0x3,%r11
             ???  0x00007fa638407e60: movzbl 0x10(%r11,%r13,1),%edx  ;*iand {reexecute=0 rethrow=0 return_oop=0}
             ???                                                ; - java.lang.StringLatin1::charAt@25 (line 49)
             ???                                                ; - java.lang.String::charAt@12 (line 693)
             ???                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
             ???                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
             ???  0x00007fa638407e66: jmpq   0x00007fa638407dea
  1.52%      ? ?  0x00007fa638407e6b: mov    0xc(%r12,%r11,8),%ecx  ; implicit exception: dispatches to 0x00007fa638407fb0
  5.99%        ?  0x00007fa638407e70: sar    %ecx               ;*ishr {reexecute=0 rethrow=0 return_oop=0}
               ?                                                ; - java.lang.StringUTF16::length@3 (line 74)
               ?                                                ; - java.lang.StringUTF16::checkIndex@2 (line 1470)
               ?                                                ; - java.lang.StringUTF16::charAt@2 (line 1267)
               ?                                                ; - java.lang.String::charAt@21 (line 695)
               ?                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
               ?                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  5.51%        ?  0x00007fa638407e72: cmp    %ecx,%ebp
  0.01%        ?  0x00007fa638407e74: jb     0x00007fa638407de0  ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}
Run Code Online (Sandbox Code Playgroud)

看起来最热门的地方是iinc(循环索引的增量)和对我来说相当不直观的ishr算术变换StringUTF16.length()

还使用 perfnorm 分析器,我发现与toCharArray()以下相比,它具有更少的周期、指令和加载缺失chatAt()

Benchmark                          Mode  Cnt     Score   Error  Units

charAt:L1-dcache-loads             avgt       2104.816           #/op
charAt:L1-dcache-stores            avgt       1200.878           #/op
charAt:branches                    avgt        603.754           #/op
charAt:cycles                      avgt       1461.282           #/op
charAt:dTLB-loads                  avgt       2105.253           #/op
charAt:dTLB-stores                 avgt       1200.909           #/op
charAt:instructions                avgt       4716.775           #/op

toCharArray:L1-dcache-loads        avgt       1026.341           #/op
toCharArray:L1-dcache-stores       avgt        416.997           #/op
toCharArray:branches               avgt        419.265           #/op
toCharArray:cycles                 avgt        820.521           #/op
toCharArray:dTLB-loads             avgt       1026.506           #/op
toCharArray:dTLB-stores            avgt        417.591           #/op
toCharArray:instructions           avgt       2409.806           #/op
Run Code Online (Sandbox Code Playgroud)

有人可以帮助解释这一点并解释如此显着的差异吗?

Ser*_*nov 1

正如 @apangin 在他的评论中提到的

问题是 BlackHole.consume 在循环内被调用。作为一种非内联黑盒方法,它阻止优化调用周围的代码,特别是缓存字符串字段。