Nit*_*raj 9 java vectorization dot-product project-panama
我正在测试OpenJDK Panama Vector API jdk.incubator.vector 并在亚马逊 c5.4xlarge 实例上进行了测试。但在每种情况下,简单展开的矢量点积都无法执行 Vector API 代码。
我的问题是:为什么我无法获得如Richard Startin 的博客中所示的性能提升。同样的性能提升也在这次会议meetup中被英特尔人讨论过。有什么不见了?
JMH 基准测试结果:
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.unrolled 1048576 thrpt 25 2440.726 ? 21.372 ops/s
FloatVector256DotProduct.vanilla 1048576 thrpt 25 807.721 ? 0.084 ops/s
FloatVector256DotProduct.vector 1048576 thrpt 25 909.977 ? 6.542 ops/s
FloatVector256DotProduct.vectorUnrolled 1048576 thrpt 25 887.422 ? 5.557 ops/s
FloatVector256DotProduct.vectorfma 1048576 thrpt 25 916.955 ? 4.652 ops/s
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 25 877.569 ? 1.451 ops/s
JavaDocExample.simpleMultiply 1048576 thrpt 25 2096.782 ? 6.778 ops/s
JavaDocExample.simpleMultiplyUnrolled 1048576 thrpt 25 1627.320 ? 6.824 ops/s
JavaDocExample.vectorMultiply 1048576 thrpt 25 2102.654 ? 11.637 ops/s
Run Code Online (Sandbox Code Playgroud)
AWS 实例类型: c5.4xlarge
CPU详细信息:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping: 4
CPU MHz: 3404.362
BogoMIPS: 5999.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
Run Code Online (Sandbox Code Playgroud)
代码片段。请参阅此 github 存储库中的完整代码
JavaDocExample :这是在 OpenJDK 的 vectorIntrinsic 分支的 java 文档中共享的。
@Benchmark
public void simpleMultiplyUnrolled() {
for (int i = 0; i < size; i += 8) {
c[i] = a[i] * b[i];
c[i + 1] = a[i + 1] * b[i + 1];
c[i + 2] = a[i + 2] * b[i + 2];
c[i + 3] = a[i + 3] * b[i + 3];
c[i + 4] = a[i + 4] * b[i + 4];
c[i + 5] = a[i + 5] * b[i + 5];
c[i + 6] = a[i + 6] * b[i + 6];
c[i + 7] = a[i + 7] * b[i + 7];
}
}
@Benchmark
public void simpleMultiply() {
for (int i = 0; i < size; i++) {
c[i] = a[i] * b[i];
}
}
@Benchmark
public void vectorMultiply() {
int i = 0;
// It is assumed array arguments are of the same size
for (; i < SPECIES.loopBound(a.length); i += SPECIES.length()) {
FloatVector va = FloatVector.fromArray(SPECIES, a, i);
FloatVector vb = FloatVector.fromArray(SPECIES, b, i);
FloatVector vc = va.mul(vb);
vc.intoArray(c, i);
}
for (; i < a.length; i++) {
c[i] = a[i] * b[i];
}
}
Run Code Online (Sandbox Code Playgroud)
FloatVector256DotProduct :这段代码是从Richard Startin 的博客中无耻地复制过来的。感谢理查德有见地的博客。
@Benchmark
public float vectorfma() {
var sum = FloatVector.zero(F256);
for (int i = 0; i < size; i += F256.length()) {
var l = FloatVector.fromArray(F256, left, i);
var r = FloatVector.fromArray(F256, right, i);
sum = l.fma(r, sum);
}
return sum.reduceLanes(ADD);
}
@Benchmark
public float vectorfmaUnrolled() {
var sum1 = FloatVector.zero(F256);
var sum2 = FloatVector.zero(F256);
var sum3 = FloatVector.zero(F256);
var sum4 = FloatVector.zero(F256);
int width = F256.length();
for (int i = 0; i < size; i += width * 4) {
sum1 = FloatVector.fromArray(F256, left, i).fma(FloatVector.fromArray(F256, right, i), sum1);
sum2 = FloatVector.fromArray(F256, left, i + width).fma(FloatVector.fromArray(F256, right, i + width), sum2);
sum3 = FloatVector.fromArray(F256, left, i + width * 2).fma(FloatVector.fromArray(F256, right, i + width * 2), sum3);
sum4 = FloatVector.fromArray(F256, left, i + width * 3).fma(FloatVector.fromArray(F256, right, i + width * 3), sum4);
}
return sum1.add(sum2).add(sum3).add(sum4).reduceLanes(ADD);
}
@Benchmark
public float vector() {
var sum = FloatVector.zero(F256);
for (int i = 0; i < size; i += F256.length()) {
var l = FloatVector.fromArray(F256, left, i);
var r = FloatVector.fromArray(F256, right, i);
sum = l.mul(r).add(sum);
}
return sum.reduceLanes(ADD);
}
@Benchmark
public float vectorUnrolled() {
var sum1 = FloatVector.zero(F256);
var sum2 = FloatVector.zero(F256);
var sum3 = FloatVector.zero(F256);
var sum4 = FloatVector.zero(F256);
int width = F256.length();
for (int i = 0; i < size; i += width * 4) {
sum1 = FloatVector.fromArray(F256, left, i).mul(FloatVector.fromArray(F256, right, i)).add(sum1);
sum2 = FloatVector.fromArray(F256, left, i + width).mul(FloatVector.fromArray(F256, right, i + width)).add(sum2);
sum3 = FloatVector.fromArray(F256, left, i + width * 2).mul(FloatVector.fromArray(F256, right, i + width * 2)).add(sum3);
sum4 = FloatVector.fromArray(F256, left, i + width * 3).mul(FloatVector.fromArray(F256, right, i + width * 3)).add(sum4);
}
return sum1.add(sum2).add(sum3).add(sum4).reduceLanes(ADD);
}
@Benchmark
public float unrolled() {
float s0 = 0f;
float s1 = 0f;
float s2 = 0f;
float s3 = 0f;
float s4 = 0f;
float s5 = 0f;
float s6 = 0f;
float s7 = 0f;
for (int i = 0; i < size; i += 8) {
s0 = Math.fma(left[i + 0], right[i + 0], s0);
s1 = Math.fma(left[i + 1], right[i + 1], s1);
s2 = Math.fma(left[i + 2], right[i + 2], s2);
s3 = Math.fma(left[i + 3], right[i + 3], s3);
s4 = Math.fma(left[i + 4], right[i + 4], s4);
s5 = Math.fma(left[i + 5], right[i + 5], s5);
s6 = Math.fma(left[i + 6], right[i + 6], s6);
s7 = Math.fma(left[i + 7], right[i + 7], s7);
}
return s0 + s1 + s2 + s3 + s4 + s5 + s6 + s7;
}
@Benchmark
public float vanilla() {
float sum = 0f;
for (int i = 0; i < size; ++i) {
sum = Math.fma(left[i], right[i], sum);
}
return sum;
}
Run Code Online (Sandbox Code Playgroud)
编译和使用 OpenJDK Panama dev vectorIntrinsic 分支所遵循的过程,如this SO question所示
hg clone http://hg.openjdk.java.net/panama/dev/
cd dev/
hg checkout vectorIntrinsics
hg branch vectorIntrinsics
bash configure
make images
Run Code Online (Sandbox Code Playgroud)
我检查了为什么它应该起作用的事情。
小智 2
我发现了这篇文章,由@iwanowww在这里回答: https: //gist.github.com/iwanowww/221df8893fbaa4b6b0904e3036221b1d。简而言之,这是一个回归问题,此后已修复,详细信息如下。
\n\nTL;DR 现在已修复
\n\n\n\n\n(1) FloatVector256DotProduct.vector* 中最新的 vectorIntrinsics 分支的回归是由向量运算 intrinsification 中的错误引起的:
\n
2675 92 b net.codingdemon.vectorization.FloatVector256DotProduct::vector (75 bytes)\n ...\n @ 3 jdk.incubator.vector.FloatVector::zero (35 bytes) force inline by annotation\n @ 6 jdk.incubator.vector.FloatVector$FloatSpecies::vectorType (5 bytes) accessor\n @ 13 jdk.incubator.vector.AbstractSpecies::length (5 bytes) accessor\n @ 19 jdk.incubator.vector.FloatVector::toBits (6 bytes) force inline by annotation\n @ 1 java.lang.Float::floatToIntBits (15 bytes) (intrinsic)\n @ 23 java.lang.invoke.Invokers$Holder::linkToTargetMethod (8 bytes) force inline by annotation\n @ 4 java.lang.invoke.LambdaForm$MH/0x0000000800b8c040::invoke (8 bytes) force inline by annotation\n @ 28 jdk.internal.vm.vector.VectorSupport::broadcastCoerced (35 bytes) failed to inline (intrinsic)\nRun Code Online (Sandbox Code Playgroud)\n\n\n\n\n以下补丁修复了该错误:
\n
diff --git a/src/hotspot/share/opto/vectorIntrinsics.cpp b/src/hotspot/share/opto/vectorIntrinsics.cpp\n--- a/src/hotspot/share/opto/vectorIntrinsics.cpp\n+++ b/src/hotspot/share/opto/vectorIntrinsics.cpp\n@@ -476,7 +476,7 @@\n\n // TODO When mask usage is supported, VecMaskNotUsed needs to be VecMaskUseLoad.\n if (!arch_supports_vector(VectorNode::replicate_opcode(elem_bt), num_elem, elem_bt,\n- is_vector_mask(vbox_klass) ? VecMaskUseStore : VecMaskNotUsed), true /*has_scalar_args*/) {\n+ (is_vector_mask(vbox_klass) ? VecMaskUseStore : VecMaskNotUsed), true /*has_scalar_args*/)) {\n if (C->print_intrinsics()) {\n tty->print_cr(" ** not supported: arity=0 op=broadcast vlen=%d etype=%s ismask=%d",\n num_elem, type2name(elem_bt),\nRun Code Online (Sandbox Code Playgroud)\n\n\n\n\n前:
\n
Benchmark (size) Mode Cnt Score Error Units\nFloatVector256DotProduct.vanilla 1048576 thrpt 5 679.280 \xc2\xb1 13.731 ops/s\nFloatVector256DotProduct.unrolled 1048576 thrpt 5 2319.770 \xc2\xb1 123.943 ops/s\nFloatVector256DotProduct.vector 1048576 thrpt 5 803.740 \xc2\xb1 42.596 ops/s\nFloatVector256DotProduct.vectorUnrolled 1048576 thrpt 5 797.153 \xc2\xb1 49.129 ops/s\nFloatVector256DotProduct.vectorfma 1048576 thrpt 5 828.172 \xc2\xb1 16.936 ops/s\nFloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 798.037 \xc2\xb1 85.566 ops/s\nJavaDocExample.simpleMultiply 1048576 thrpt 5 1888.662 \xc2\xb1 55.922 ops/s\nJavaDocExample.simpleMultiplyUnrolled 1048576 thrpt 5 1486.322 \xc2\xb1 93.864 ops/s\nJavaDocExample.vectorMultiply 1048576 thrpt 5 1525.046 \xc2\xb1 110.700 ops/s\nRun Code Online (Sandbox Code Playgroud)\n\n\n\n\n后:
\n
Benchmark (size) Mode Cnt Score Error Units\nFloatVector256DotProduct.vanilla 1048576 thrpt 5 666.581 \xc2\xb1 8.727 ops/s\nFloatVector256DotProduct.unrolled 1048576 thrpt 5 2416.695 \xc2\xb1 106.223 ops/s\nFloatVector256DotProduct.vector 1048576 thrpt 5 3776.422 \xc2\xb1 117.357 ops/s\nFloatVector256DotProduct.vectorUnrolled 1048576 thrpt 5 3734.246 \xc2\xb1 122.463 ops/s\nFloatVector256DotProduct.vectorfma 1048576 thrpt 5 3804.485 \xc2\xb1 44.797 ops/s\nFloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 1158.018 \xc2\xb1 15.955 ops/s\nJavaDocExample.simpleMultiply 1048576 thrpt 5 1914.794 \xc2\xb1 51.329 ops/s\nJavaDocExample.simpleMultiplyUnrolled 1048576 thrpt 5 1405.345 \xc2\xb1 52.025 ops/s\nJavaDocExample.vectorMultiply 1048576 thrpt 5 1832.133 \xc2\xb1 56.256 ops/s\nRun Code Online (Sandbox Code Playgroud)\n\n\n\n\n(2) vectorfmaUnrolled 中的回归(与 vectorfma 相比)是由众所周知的内联问题引起的,这些问题破坏了向量框消除:
\n
Benchmark (size) Mode Cnt Score Error Units\nFloatVector256DotProduct.vectorfma 1048576 thrpt 5 3804.485 \xc2\xb1 44.797 ops/s\nFloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 1158.018 \xc2\xb1 15.955 ops/s\n\n19727 95 b net.codingdemon.vectorization.FloatVector256DotProduct::vectorfmaUnrolled (228 bytes)\n ...\n @ 209 jdk.incubator.vector.FloatVector::add (9 bytes) force inline by annotation\n @ 5 jdk.incubator.vector.FloatVector::lanewise (0 bytes) virtual call\n @ 213 jdk.incubator.vector.FloatVector::add (9 bytes) force inline by annotation\n @ 5 jdk.incubator.vector.FloatVector::lanewise (0 bytes) virtual call\n @ 218 jdk.incubator.vector.FloatVector::add (9 bytes) force inline by annotation\n @ 5 jdk.incubator.vector.FloatVector::lanewise (0 bytes) virtual call\n ...\n\nBenchmark (size) Mode Cnt Score Error Units\nFloatVector256DotProduct.vectorfma 1048576 thrpt 5 3938.922 \xc2\xb1 97.041 ops/s\nFloatVector256DotProduct.vectorfma:\xc2\xb7gc.alloc.rate.norm 1048576 thrpt 5 0.111 \xc2\xb1 0.003 B/op\n\nFloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 5 2052.549 \xc2\xb1 68.859 ops/s\nFloatVector256DotProduct.vectorfmaUnrolled:\xc2\xb7gc.alloc.rate.norm 1048576 thrpt 5 1573537.127 \xc2\xb1 22.886 B/op\nRun Code Online (Sandbox Code Playgroud)\n\n\n\n\n在修复内联之前,作为解决方法,使用较小数据输入的预热阶段可以帮助:
\n
Benchmark (size) Mode Cnt Score Error Units\nFloatVector256DotProduct.vectorfma 128 thrpt 5 54838734.769 \xc2\xb1 161477.746 ops/s\nFloatVector256DotProduct.vectorfma:\xc2\xb7gc.alloc.rate.norm 128 thrpt 5 \xe2\x89\x88 10\xe2\x81\xbb\xe2\x81\xb5 B/op\n\nFloatVector256DotProduct.vectorfmaUnrolled 128 thrpt 5 68993637.658 \xc2\xb1 359974.720 ops/s\nFloatVector256DotProduct.vectorfmaUnrolled:\xc2\xb7gc.alloc.rate.norm 128 thrpt 5 \xe2\x89\x88 10\xe2\x81\xbb\xe2\x81\xb5 B/op\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
1018 次 |
| 最近记录: |