Java 8u40 Math.round()非常慢

use*_*373 21 java jit jvm

我有一个用Java 8编写的相当简单的业余爱好项目,它在其一种操作模式中广泛使用重复的Math.round()调用.例如,一个这样的模式产生4个线程,并通过ExecutorService对48个可运行的任务进行排队,每个任务运行类似于下面的代码块2 ^ 31次:

int3 = Math.round(float1 + float2);
int3 = Math.round(float1 * float2);
int3 = Math.round(float1 / float2);
Run Code Online (Sandbox Code Playgroud)

这不完全是如何(涉及数组和嵌套循环),但你明白了.无论如何,在Java 8u40之前,类似于上述代码的代码可以在AMD A10-7700k上在大约13秒内完成大约1030亿个指令块的全部运行.使用Java 8u40,执行相同的操作大约需要260秒.代码没有变化,没有任何变化,只是Java更新.

有没有人注意到Math.round()变得慢得多,特别是当它被重复使用时?几乎就好像JVM正在进行某种优化之前它已经不再做了.也许它是在8u40之前使用SIMD而现在不是?

编辑:我已经完成了对MVCE的第二次尝试.你可以在这里下载第一次尝试:

https://www.dropbox.com/s/rm2ftcv8y6ye1bi/MathRoundMVCE.zip?dl=0

第二次尝试如下.我的第一次尝试已经从这篇文章中删除,因为它被认为太长了,并且很容易被JVM去掉代码去除(显然在8u40中发生的事情更少).

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class MathRoundMVCE
{           
    static long grandtotal = 0;
    static long sumtotal = 0;

    static float[] float4 = new float[128];
    static float[] float5 = new float[128];
    static int[] int6 = new int[128];
    static int[] int7 = new int[128];
    static int[] int8 = new int[128];
    static long[] longarray = new long[480];

    final static int mil = 1000000;

    public static void main(String[] args)
    {       
        initmainarrays();
        OmniCode omni = new OmniCode();
        grandtotal = omni.runloops() / mil;
        System.out.println("Total sum of operations is " + sumtotal);
        System.out.println("Total execution time is " + grandtotal + " milliseconds");
    }   

    public static long siftarray(long[] larray)
    {
        long topnum = 0;
        long tempnum = 0;
        for (short i = 0; i < larray.length; i++)
        {
            tempnum = larray[i];
            if (tempnum > 0)
            {
                topnum += tempnum;
            }
        }
        topnum = topnum / Runtime.getRuntime().availableProcessors();
        return topnum;
    }

    public static void initmainarrays()
    {
        int k = 0;

        do
        {           
            float4[k] = (float)(Math.random() * 12) + 1f;
            float5[k] = (float)(Math.random() * 12) + 1f;
            int6[k] = 0;

            k++;
        }
        while (k < 128);        
    }       
}

class OmniCode extends Thread
{           
    volatile long totaltime = 0;
    final int standard = 16777216;
    final int warmup = 200000;

    byte threads = 0;

    public long runloops()
    {
        this.setPriority(MIN_PRIORITY);

        threads = (byte)Runtime.getRuntime().availableProcessors();
        ExecutorService executor = Executors.newFixedThreadPool(threads);

        for (short j = 0; j < 48; j++)
        {           
            executor.execute(new RoundFloatToIntAlternate(warmup, (byte)j));
        }

        executor.shutdown();

        while (!executor.isTerminated())
        {
            try
            {
                Thread.sleep(100);
            } 
            catch (InterruptedException e)
            {
                //Do nothing                
            }
        }

        executor = Executors.newFixedThreadPool(threads);

        for (short j = 0; j < 48; j++)
        {           
            executor.execute(new RoundFloatToIntAlternate(standard, (byte)j));          
        }

        executor.shutdown();

        while (!executor.isTerminated())
        {
            try
            {
                Thread.sleep(100);
            } 
            catch (InterruptedException e)
            {
                //Do nothing                
            }
        }

        totaltime = MathRoundMVCE.siftarray(MathRoundMVCE.longarray);   

        executor = null;
        Runtime.getRuntime().gc();
        return totaltime;
    }
}

class RoundFloatToIntAlternate extends Thread
{       
    int i = 0;
    int j = 0;
    int int3 = 0;
    int iterations = 0;
    byte thread = 0;

    public RoundFloatToIntAlternate(int cycles, byte threadnumber)
    {
        iterations = cycles;
        thread = threadnumber;
    }

    public void run()
    {
        this.setPriority(9);
        MathRoundMVCE.longarray[this.thread] = 0;
        mainloop();
        blankloop();    

    }

    public void blankloop()
    {
        j = 0;
        long timer = 0;
        long totaltimer = 0;

        do
        {   
            timer = System.nanoTime();
            i = 0;

            do
            {
                i++;
            }
            while (i < 128);
            totaltimer += System.nanoTime() - timer;            

            j++;
        }
        while (j < iterations);         

        MathRoundMVCE.longarray[this.thread] -= totaltimer;
    }

    public void mainloop()
    {
        j = 0;
        long timer = 0; 
        long totaltimer = 0;
        long localsum = 0;

        int[] int6 = new int[128];
        int[] int7 = new int[128];
        int[] int8 = new int[128];

        do
        {   
            timer = System.nanoTime();
            i = 0;

            do
            {
                int6[i] = Math.round(MathRoundMVCE.float4[i] + MathRoundMVCE.float5[i]);
                int7[i] = Math.round(MathRoundMVCE.float4[i] * MathRoundMVCE.float5[i]);
                int8[i] = Math.round(MathRoundMVCE.float4[i] / MathRoundMVCE.float5[i]);

                i++;
            }
            while (i < 128);
            totaltimer += System.nanoTime() - timer;

            for(short z = 0; z < 128; z++)
            {
                localsum += int6[z] + int7[z] + int8[z];
            }       

            j++;
        }
        while (j < iterations);         

        MathRoundMVCE.longarray[this.thread] += totaltimer;
        MathRoundMVCE.sumtotal = localsum;
    }
}
Run Code Online (Sandbox Code Playgroud)

长话短说,这段代码在8u25和8u40中大致相同.如您所见,我现在将所有计算的结果记录到数组中,然后将循环的定时部分之外的那些数组求和为局部变量,然后将其写入外部循环末尾的静态变量.

8u25以下:总执行时间为261545毫秒

8u40以下:总执行时间为266890毫秒

测试条件与以前相同.因此,似乎8u25和8u31正在执行8u40停止执行的死代码删除,导致代码在8u40中"减速".这并没有解释每一个奇怪的小东西,但似乎是它的大部分.作为一个额外的奖励,这里提供的建议和答案给了我灵感,以改善我的业余爱好项目的其他部分,我非常感激.谢谢大家!

Iva*_*tov 15

休闲基准测试:您基准A,但实际测量B,并得出结论您已经测量C.

现代JVM过于复杂,需要进行各种优化.如果您尝试测量一小段代码,那么在没有非常详细了解JVM正在做什么的情况下正确执行它是非常复杂的.许多基准测试的罪魁祸首是消除死代码:编译器足够聪明,可以推断出一些冗余的计算,并完全消除它们.请阅读以下幻灯片http://shipilev.net/talks/jvmls-July2014-benchmarking.pdf.为了"修复"Adam的微基准标记(我仍然无法理解它的测量结果,这个"修复"没有考虑到预热,OSR和许多其他微观标记陷阱)我们必须将计算结果打印到系统输出:

    int result = 0;
    long t0 = System.currentTimeMillis();
    for (int i = 0; i < 1e9; i++) {
        result += Math.round((float) i / (float) (i + 1));
    }
    long t1 = System.currentTimeMillis();
    System.out.println("result = " + result);
    System.out.println(String.format("%s, Math.round(float), %.1f ms", System.getProperty("java.version"), (t1 - t0)/1f));
Run Code Online (Sandbox Code Playgroud)

结果是:

result = 999999999
1.8.0_25, Math.round(float), 5251.0 ms

result = 999999999
1.8.0_40, Math.round(float), 3903.0 ms
Run Code Online (Sandbox Code Playgroud)

原始MVCE示例的"修复"相同

It took 401772 milliseconds to complete edu.jvm.runtime.RoundFloatToInt. <==== 1.8.0_40

It took 410767 milliseconds to complete edu.jvm.runtime.RoundFloatToInt. <==== 1.8.0_25
Run Code Online (Sandbox Code Playgroud)

如果你想衡量Math#round的实际成本,你应该写这样的东西(基于jmh)

package org.openjdk.jmh.samples;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import org.openjdk.jmh.runner.options.VerboseMode;

import java.util.Random;
import java.util.concurrent.TimeUnit;

@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
public class RoundBench {

    float[] floats;
    int i;

    @Setup
    public void initI() {
        Random random = new Random(0xDEAD_BEEF);
        floats = new float[8096];
        for (int i = 0; i < floats.length; i++) {
            floats[i] = random.nextFloat();
        }
    }

    @Benchmark
    public float baseline() {
        i++;
        i = i & 0xFFFFFF00;
        return floats[i];
    }

    @Benchmark
    public int round() {
        i++;
        i = i & 0xFFFFFF00;
        return Math.round(floats[i]);
    }

    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder()
                .include(RoundBench.class.getName())
                .build();
        new Runner(options).run();
    }
}
Run Code Online (Sandbox Code Playgroud)

我的结果是:

1.8.0_25
Benchmark            Mode  Cnt  Score   Error  Units
RoundBench.baseline  avgt    6  2.565 ± 0.028  ns/op
RoundBench.round     avgt    6  4.459 ± 0.065  ns/op

1.8.0_40 
Benchmark            Mode  Cnt  Score   Error  Units
RoundBench.baseline  avgt    6  2.589 ± 0.045  ns/op
RoundBench.round     avgt    6  4.588 ± 0.182  ns/op
Run Code Online (Sandbox Code Playgroud)

为了找到问题的根本原因,您可以使用https://github.com/AdoptOpenJDK/jitwatch/.为了节省时间,我可以说Math#round的JITted代码的大小在8.0_40中增加了.对于小方法来说,这几乎是不明显的,但是在大型方法的情况下,太长的机器代码会污染指令缓存.


Ada*_*dam 10

MVCE基于OP

  • 可能会进一步简化
  • 更改int3 =语句以int3 +=减少死代码删除的可能性.int3 =从8u31到8u40的差异是因子慢3倍.使用int3 +=差异仅慢15%.
  • 打印结果,以进一步减少死代码删除优化的机会

public class MathTime {
    static float[][] float1 = new float[8][16];
    static float[][] float2 = new float[8][16];

    public static void main(String[] args) {
        for (int j = 0; j < 8; j++) {
            for (int k = 0; k < 16; k++) {
                float1[j][k] = (float) (j + k);
                float2[j][k] = (float) (j + k);
            }
        }
        new Test().run();
    }

    private static class Test {
        int int3;

        public void run() {
            for (String test : new String[] { "warmup", "real" }) {

                long t0 = System.nanoTime();

                for (int count = 0; count < 1e7; count++) {
                    int i = count % 8;
                    int3 += Math.round(float1[i][0] + float2[i][0]);
                    int3 += Math.round(float1[i][1] + float2[i][1]);
                    int3 += Math.round(float1[i][2] + float2[i][2]);
                    int3 += Math.round(float1[i][3] + float2[i][3]);
                    int3 += Math.round(float1[i][4] + float2[i][4]);
                    int3 += Math.round(float1[i][5] + float2[i][5]);
                    int3 += Math.round(float1[i][6] + float2[i][6]);
                    int3 += Math.round(float1[i][7] + float2[i][7]);
                    int3 += Math.round(float1[i][8] + float2[i][8]);
                    int3 += Math.round(float1[i][9] + float2[i][9]);
                    int3 += Math.round(float1[i][10] + float2[i][10]);
                    int3 += Math.round(float1[i][11] + float2[i][11]);
                    int3 += Math.round(float1[i][12] + float2[i][12]);
                    int3 += Math.round(float1[i][13] + float2[i][13]);
                    int3 += Math.round(float1[i][14] + float2[i][14]);
                    int3 += Math.round(float1[i][15] + float2[i][15]);

                    int3 += Math.round(float1[i][0] * float2[i][0]);
                    int3 += Math.round(float1[i][1] * float2[i][1]);
                    int3 += Math.round(float1[i][2] * float2[i][2]);
                    int3 += Math.round(float1[i][3] * float2[i][3]);
                    int3 += Math.round(float1[i][4] * float2[i][4]);
                    int3 += Math.round(float1[i][5] * float2[i][5]);
                    int3 += Math.round(float1[i][6] * float2[i][6]);
                    int3 += Math.round(float1[i][7] * float2[i][7]);
                    int3 += Math.round(float1[i][8] * float2[i][8]);
                    int3 += Math.round(float1[i][9] * float2[i][9]);
                    int3 += Math.round(float1[i][10] * float2[i][10]);
                    int3 += Math.round(float1[i][11] * float2[i][11]);
                    int3 += Math.round(float1[i][12] * float2[i][12]);
                    int3 += Math.round(float1[i][13] * float2[i][13]);
                    int3 += Math.round(float1[i][14] * float2[i][14]);
                    int3 += Math.round(float1[i][15] * float2[i][15]);

                    int3 += Math.round(float1[i][0] / float2[i][0]);
                    int3 += Math.round(float1[i][1] / float2[i][1]);
                    int3 += Math.round(float1[i][2] / float2[i][2]);
                    int3 += Math.round(float1[i][3] / float2[i][3]);
                    int3 += Math.round(float1[i][4] / float2[i][4]);
                    int3 += Math.round(float1[i][5] / float2[i][5]);
                    int3 += Math.round(float1[i][6] / float2[i][6]);
                    int3 += Math.round(float1[i][7] / float2[i][7]);
                    int3 += Math.round(float1[i][8] / float2[i][8]);
                    int3 += Math.round(float1[i][9] / float2[i][9]);
                    int3 += Math.round(float1[i][10] / float2[i][10]);
                    int3 += Math.round(float1[i][11] / float2[i][11]);
                    int3 += Math.round(float1[i][12] / float2[i][12]);
                    int3 += Math.round(float1[i][13] / float2[i][13]);
                    int3 += Math.round(float1[i][14] / float2[i][14]);
                    int3 += Math.round(float1[i][15] / float2[i][15]);

                }
                long t1 = System.nanoTime();
                System.out.println(int3);
                System.out.println(String.format("%s, Math.round(float), %s, %.1f ms", System.getProperty("java.version"), test, (t1 - t0) / 1e6));
            }
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

结果

adam@brimstone:~$ ./jdk1.8.0_40/bin/javac MathTime.java;./jdk1.8.0_40/bin/java -cp . MathTime 
1.8.0_40, Math.round(float), warmup, 6846.4 ms
1.8.0_40, Math.round(float), real, 6058.6 ms
adam@brimstone:~$ ./jdk1.8.0_31/bin/javac MathTime.java;./jdk1.8.0_31/bin/java -cp . MathTime 
1.8.0_31, Math.round(float), warmup, 5717.9 ms
1.8.0_31, Math.round(float), real, 5282.7 ms
adam@brimstone:~$ ./jdk1.8.0_25/bin/javac MathTime.java;./jdk1.8.0_25/bin/java -cp . MathTime 
1.8.0_25, Math.round(float), warmup, 5702.4 ms
1.8.0_25, Math.round(float), real, 5262.2 ms
Run Code Online (Sandbox Code Playgroud)

意见

  • 对于Math.round(float)的简单使用,我发现我的平台(Linux x86_64)的性能没有差别.基准测试只有一个区别,我以前的天真和不正确的基准测试只暴露了优化行为的差异,因为Ivan的回答和Marco13的评论指出.
  • 8u40在死代码消除方面比以前的版本更具攻击性,这意味着在某些极端情况下会执行更多代码,因此速度更慢.
  • 8u40需要稍长时间来预热,但一旦"那里",就会更快.

来源分析

令人惊讶的是Math.round(float)是一个纯Java实现而不是本机,8u31和8u40的代码是相同的.

diff  jdk1.8.0_31/src/java/lang/Math.java jdk1.8.0_40/src/java/lang/Math.java
-no differences-

public static int round(float a) {
    int intBits = Float.floatToRawIntBits(a);
    int biasedExp = (intBits & FloatConsts.EXP_BIT_MASK)
            >> (FloatConsts.SIGNIFICAND_WIDTH - 1);
    int shift = (FloatConsts.SIGNIFICAND_WIDTH - 2
            + FloatConsts.EXP_BIAS) - biasedExp;
    if ((shift & -32) == 0) { // shift >= 0 && shift < 32
        // a is a finite number such that pow(2,-32) <= ulp(a) < 1
        int r = ((intBits & FloatConsts.SIGNIF_BIT_MASK)
                | (FloatConsts.SIGNIF_BIT_MASK + 1));
        if (intBits < 0) {
            r = -r;
        }
        // In the comments below each Java expression evaluates to the value
        // the corresponding mathematical expression:
        // (r) evaluates to a / ulp(a)
        // (r >> shift) evaluates to floor(a * 2)
        // ((r >> shift) + 1) evaluates to floor((a + 1/2) * 2)
        // (((r >> shift) + 1) >> 1) evaluates to floor(a + 1/2)
        return ((r >> shift) + 1) >> 1;
    } else {
        // a is either
        // - a finite number with abs(a) < exp(2,FloatConsts.SIGNIFICAND_WIDTH-32) < 1/2
        // - a finite number with ulp(a) >= 1 and hence a is a mathematical integer
        // - an infinity or NaN
        return (int) a;
    }
}
Run Code Online (Sandbox Code Playgroud)

  • 请在测试前预热JVM,用单独的方法提取测试代码......你的microbencmark中有很多陷阱.编写正确的微基准测试并不容易,请参阅[如何在Java中编写正确的微基准测试](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro -benchmark式的Java)? (4认同)
  • 你应该使用`System.nanoTime()`来进行基准测试,因为`System.currentTimeMillis()`不能保证单调递增.这可能不会影响你在这里的有趣结果,但仍然如此. (3认同)