Alt*_*ame 8 java performance simd vectorization java-17
我有一个与 Java 17 新的 Vector API 功能中的 pow() 函数相关的问题。我正在尝试以矢量化方式实现布莱克斯科尔斯公式,但我很难获得与标量实现相同的性能
\n代码如下:
\n以下是一些代码片段:
\n public static double[] createArray(int arrayLength)\n {\n double[] array0 = new double[arrayLength];\n for(int i=0;i<arrayLength;i++)\n {\n array0[i] = 2.0;\n }\n return array0;\n } \n
Run Code Online (Sandbox Code Playgroud)\n @Param({"256000"})\n int arraySize;\n public static final VectorSpecies<Double> SPECIES = DoubleVector.SPECIES_PREFERRED;\n DoubleVector vectorTwo = DoubleVector.broadcast(SPECIES,2);\n DoubleVector vectorHundred = DoubleVector.broadcast(SPECIES,100);\n\n double[] scalarTwo = new double[]{2,2,2,2};\n double[] scalarHundred = new double[]{100,100,100,100};\n\n @Setup\n public void Setup()\n {\n javaSIMD = new JavaSIMD();\n javaScalar = new JavaScalar();\n spotPrices = createArray(arraySize);\n timeToMaturity = createArray(arraySize);\n strikePrice = createArray(arraySize);\n interestRate = createArray(arraySize);\n volatility = createArray(arraySize);\n e = new double[arraySize];\n for(int i=0;i<arraySize;i++)\n {\n e[i] = Math.exp(1);\n }\n upperBound = SPECIES.loopBound(spotPrices.length);\n }\n @Benchmark\n @BenchmarkMode(Mode.Throughput)\n @OutputTimeUnit(TimeUnit.MILLISECONDS)\n public void testVectorPerformance(Blackhole bh) {\n var upperBound = SPECIES.loopBound(spotPrices.length);\n for (var i=0;i<upperBound; i+= SPECIES.length())\n {\n bh.consume(javaSIMD.calculateBlackScholesSingleCalc(spotPrices,timeToMaturity,strikePrice,\n interestRate,volatility,e, i));\n }\n }\n\n @Benchmark\n @BenchmarkMode(Mode.Throughput)\n @OutputTimeUnit(TimeUnit.MILLISECONDS)\n public void testScalarPerformance(Blackhole bh) {\n for(int i=0;i<arraySize;i++)\n {\n bh.consume(javaScalar.calculateBlackScholesSingleCycle(spotPrices,timeToMaturity,strikePrice,\n interestRate,volatility, i,normDist));\n }\n }\n
Run Code Online (Sandbox Code Playgroud)\n public DoubleVector calculateBlackScholesSingleCalc(double[] spotPrices, double[] timeToMaturity, double[] strikePrice,\n double[] interestRate, double[] volatility, double[] e,int i){\n...(skip lines)\n DoubleVector vSpot = DoubleVector.fromArray(SPECIES, spotPrices, i);\n...(skip lines)\n DoubleVector powerOperand = vRateScaled\n .mul(vTime)\n .neg();\n DoubleVector call = (vSpot\n .mul(CDFVectorizedExcelOptimized(d1,vE)))\n .sub(vStrike\n .mul(vE\n .pow(powerOperand))\n .mul(CDFVectorizedExcelOptimized(d2,vE)));\n return call;\n
Run Code Online (Sandbox Code Playgroud)\n以下是使用 WSL 在 Ryzen 5800X 上进行的一些 JMH 基准测试(2 个分叉、2 个预热、2 次迭代):总体而言,它似乎比标量版本慢约 2 倍。我分别运行了一个简单的之前时间和之后时间,该方法没有 JMH,它看起来是内联的。
\nResult "blackScholes.TestJavaPerf.testScalarPerformance":\n 0.116 \xc2\xb1(99.9%) 0.002 ops/ms [Average]\n 89873915287 cycles:u # 4.238 GHz (40.43%)\n 242060738532 instructions:u # 2.69 insn per cycle \n\n \nResult "blackScholes.TestJavaPerf.testVectorPerformance":\n 0.071 \xc2\xb1(99.9%) 0.001 ops/ms [Average]\n 90878787665 cycles:u # 4.072 GHz (39.25%)\n 254117779312 instructions:u # 2.80 insn per cycle \n
Run Code Online (Sandbox Code Playgroud)\n我还启用了 JVM 的诊断选项。我看到以下内容:
\n"-XX:+UnlockDiagnosticVMOptions", "-XX:+PrintIntrinsics","-XX:+PrintAssembly"\n
Run Code Online (Sandbox Code Playgroud)\n 0x00007fe451959413: call 0x00007fe451239f00 ; ImmutableOopMap {rsi=Oop }\n ;*synchronization entry\n ; - jdk.incubator.vector.DoubleVector::arrayAddress@-1 (line 3283)\n ; {runtime_call counter_overflow Runtime1 stub}\n 0x00007fe451959418: jmp 0x00007fe4519593ce\n 0x00007fe45195941a: movabs $0x7fe4519593ee,%r10 ; {internal_word}\n 0x00007fe451959424: mov %r10,0x358(%r15)\n 0x00007fe45195942b: jmp 0x00007fe451193100 ; {runtime_call SafepointBlob}\n 0x00007fe451959430: nop\n 0x00007fe451959431: nop\n 0x00007fe451959432: mov 0x3d0(%r15),%rax\n 0x00007fe451959439: movq $0x0,0x3d0(%r15)\n 0x00007fe451959444: movq $0x0,0x3d8(%r15)\n 0x00007fe45195944f: add $0x40,%rsp\n 0x00007fe451959453: pop %rbp\n 0x00007fe451959454: jmp 0x00007fe451231e80 ; {runtime_call unwind_exception Runtime1 stub}\n 0x00007fe451959459: hlt \n<More halts cut off> \n[Exception Handler]\n 0x00007fe451959460: call 0x00007fe451234580 ; {no_reloc}\n 0x00007fe451959465: movabs $0x7fe46e76df9a,%rdi ; {external_word}\n 0x00007fe45195946f: and $0xfffffffffffffff0,%rsp\n 0x00007fe451959473: call 0x00007fe46e283d40 ; {runtime_call}\n 0x00007fe451959478: hlt \n[Deopt Handler Code]\n 0x00007fe451959479: movabs $0x7fe451959479,%r10 ; {section_word}\n 0x00007fe451959483: push %r10\n 0x00007fe451959485: jmp 0x00007fe4511923a0 ; {runtime_call DeoptimizationBlob}\n 0x00007fe45195948a: hlt \n<More halts cut off>\n--------------------------------------------------------------------------------\n\n============================= C2-compiled nmethod ==============================\n ** svml call failed for double_pow_32\n @ 3 jdk.internal.misc.Unsafe::loadFence (0 bytes) (intrinsic)\n @ 3 jdk.internal.misc.Unsafe::loadFence (0 bytes) (intrinsic)\n @ 2 java.lang.Math::pow (6 bytes) (intrinsic)\n
Run Code Online (Sandbox Code Playgroud)\n调查/问题:
\n注意:我相信它使用的是 256 位宽度向量(在调试过程中检查)
\n这可能与 JDK-8262275 有关,double64 向量不会调用数学向量存根
\n\n\n对于 Double64Vector,svml 数学向量存根内在化失败,并且不会从 jitted 代码中调用它们。
\n
\n但是我们确实有 svml double64 向量。
您可以尝试替代操作,例如,您可以使用对所有通道执行e xvE.pow(powerOperand)
,而不是成为evE
的向量。powerOperand.lanewise(VectorOperators.EXP)
请记住,此 API 正在孵化器状态\xe2\x80\xa6 中进行中
\n