为什么 LongStream reduce 和 sum 性能存在差异?

Nic*_*ick 7 java benchmarking java-8 java-stream

我使用LongStream'srangeClosed来测试数字总和的性能。当我通过JMH测试性能时,结果如下。

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Fork(value = 1, jvmArgs = {"-Xms4G", "-Xmx4G"})
@State(Scope.Benchmark)
@Warmup(iterations = 10, time = 10)
@Measurement(iterations = 10, time = 10)
public class ParallelStreamBenchmark {
  private static final long N = 10000000L;

  @Benchmark
  public long sequentialSum() {
    return Stream.iterate(1L, i -> i + 1).limit(N).reduce(0L, Long::sum);
  }

  @Benchmark
  public long parallelSum() {
    return Stream.iterate(1L, i -> i + 1).limit(N).parallel().reduce(0L, Long::sum);
  }

  @Benchmark
  public long rangedReduceSum() {
    return LongStream.rangeClosed(1, N).reduce(0, Long::sum);
  }

  @Benchmark
  public long rangedSum() {
    return LongStream.rangeClosed(1, N).sum();
  }

  @Benchmark
  public long parallelRangedReduceSum() {
    return LongStream.rangeClosed(1, N).parallel().reduce(0L, Long::sum);
  }

  @Benchmark
  public long parallelRangedSum() {
    return LongStream.rangeClosed(1, N).parallel().sum();
  }

  @TearDown(Level.Invocation)
  public void tearDown() {
    System.gc();
  }
Run Code Online (Sandbox Code Playgroud)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Fork(value = 1, jvmArgs = {"-Xms4G", "-Xmx4G"})
@State(Scope.Benchmark)
@Warmup(iterations = 10, time = 10)
@Measurement(iterations = 10, time = 10)
public class ParallelStreamBenchmark {
  private static final long N = 10000000L;

  @Benchmark
  public long sequentialSum() {
    return Stream.iterate(1L, i -> i + 1).limit(N).reduce(0L, Long::sum);
  }

  @Benchmark
  public long parallelSum() {
    return Stream.iterate(1L, i -> i + 1).limit(N).parallel().reduce(0L, Long::sum);
  }

  @Benchmark
  public long rangedReduceSum() {
    return LongStream.rangeClosed(1, N).reduce(0, Long::sum);
  }

  @Benchmark
  public long rangedSum() {
    return LongStream.rangeClosed(1, N).sum();
  }

  @Benchmark
  public long parallelRangedReduceSum() {
    return LongStream.rangeClosed(1, N).parallel().reduce(0L, Long::sum);
  }

  @Benchmark
  public long parallelRangedSum() {
    return LongStream.rangeClosed(1, N).parallel().sum();
  }

  @TearDown(Level.Invocation)
  public void tearDown() {
    System.gc();
  }
Run Code Online (Sandbox Code Playgroud)

rangedReduceSum和之间的区别rangedSum是只使用了内部函数 sum() 。为什么会有如此大的性能差异?

验证sum()函数最终使用了之后reduce(0, Long::sum),是不是reduce(0, Long::sum)rangedReduceSum方法中使用的一样?

tex*_*uce 1

我执行了与 OP 相同的任务,并且可以重现完全相同的结果:第二个任务慢了大约 3 倍。但是当我将预热更改为仅 1 次迭代时,事情开始变得有趣:

\n\n
# Benchmark: test.ParallelStreamBenchmark.rangedReduceSum\n# Warmup Iteration   1: 3.619 ms/op\nIteration   1: 3.931 ms/op\nIteration   2: 3.927 ms/op\nIteration   3: 3.834 ms/op\nIteration   4: 4.006 ms/op\nIteration   5: 4.605 ms/op\nIteration   6: 6.454 ms/op\nIteration   7: 6.466 ms/op\nIteration   8: 6.328 ms/op\nIteration   9: 6.370 ms/op\nIteration  10: 6.244 ms/op\n\n# Benchmark: test.ParallelStreamBenchmark.rangedSum\n# Warmup Iteration   1: 3.971 ms/op\nIteration   1: 4.034 ms/op\nIteration   2: 3.970 ms/op\nIteration   3: 3.957 ms/op\nIteration   4: 4.024 ms/op\nIteration   5: 4.278 ms/op\nIteration   6: 19.302 ms/op\nIteration   7: 19.132 ms/op\nIteration   8: 19.189 ms/op\nIteration   9: 18.842 ms/op\nIteration  10: 18.292 ms/op\n\nBenchmark                                Mode  Cnt   Score    Error  Units\nParallelStreamBenchmark.rangedReduceSum  avgt   10   5.216 \xc2\xb1  1.871  ms/op\nParallelStreamBenchmark.rangedSum        avgt   10  11.502 \xc2\xb1 11.879  ms/op\n
Run Code Online (Sandbox Code Playgroud)\n\n

在第 5 次迭代后,每个任务都显着减慢。对于第二个任务,在第 5 次迭代之后,它的速度减慢了 3 倍。如果我们将预热算作迭代,那么在 10 次迭代之后,开始放慢速度就已经有意义了。看起来像是 Benchmark 库中的一个错误,它不能很好地与 GC 配合使用。但正如警告所说,这种情况下的基准测试结果仅供参考。

\n