Wil*_*hen 5 c java performance benchmarking jvm
我正在尝试优化一个小型库来对向量进行算术运算。
\nTo roughly check my progress, I decided to benchmark the performance of two popular vector arithmetic libraries written in two different languages, the GNU Scientific Library (GSL, C), and the Java OpenGL Math Library (JOML, JVM). I expected GSL, as a large project written in C and compiled ahead of time, to be significantly faster than JOML, with extra baggage from object management, method calls, and conforming to the Java specifications.
\nSurprisingly, instead JOML (JVM) ended up being around 4X faster than GSL (C). I wish to understand why this is the case.
\nThe benchmark I performed was to compute 4,000,000 iterations of Leibniz\'s formula to calculate Pi, in chunks of 4 at a time via 4-dimensioned vectors. The exact algorithm doesn\'t matter, and doesn\'t have to make sense. This was just the first and simplest thing I thought of that would let me use multiple vector operations per iteration.
\nThis is the C code in question:
\n#include <stdio.h>\n#include <time.h>\n#include <gsl/gsl_vector.h>\n#include <unistd.h>\n#include <math.h>\n#include <string.h>\n\n#define IT 1000000\n\ndouble pibench_inplace(int it) {\n gsl_vector* d = gsl_vector_calloc(4);\n gsl_vector* w = gsl_vector_calloc(4);\n for (int i=0; i<4; i++) {\n gsl_vector_set(d, i, (double)i*2+1);\n gsl_vector_set(w, i, (i%2==0) ? 1 : -1);\n }\n gsl_vector* b = gsl_vector_calloc(4);\n double pi = 0.0;\n for (int i=0; i<it; i++) {\n gsl_vector_memcpy(b, d);\n gsl_vector_add_constant(b, (double)i*8);\n for (int i=0; i<4; i++) {\n gsl_vector_set(b, i, pow(gsl_vector_get(b, i), -1.));\n }\n gsl_vector_mul(b, w);\n pi += gsl_vector_sum(b);\n }\n return pi*4;\n}\n\ndouble pibench_fast(int it) {\n double pi = 0;\n int eq_it = it * 4;\n for (int i=0; i<eq_it; i++) {\n pi += (1 / ((double)i * 2 + 1) * ((i%2==0) ? 1 : -1));\n }\n return pi*4;\n}\n\nint main(int argc, char* argv[]) {\n if (argc < 2) {\n printf("Please specific a run mode.\\n");\n return 1;\n }\n double pi;\n struct timespec start = {0,0}, end={0,0};\n clock_gettime(CLOCK_MONOTONIC, &start);\n if (strcmp(argv[1], "inplace") == 0) {\n pi = pibench_inplace(IT);\n } else if (strcmp(argv[1], "fast") == 0) {\n pi = pibench_fast(IT);\n } else {\n sleep(1);\n printf("Please specific a valid run mode.\\n");\n }\n clock_gettime(CLOCK_MONOTONIC, &end);\n printf("Pi: %f\\n", pi);\n printf("Time: %f\\n", ((double)end.tv_sec + 1.0e-9*end.tv_nsec) - ((double)start.tv_sec + 1.0e-9*start.tv_nsec));\n return 0;\n}\n\nRun Code Online (Sandbox Code Playgroud)\nThis is how I built and ran the C code:
\n$ gcc GSL_pi.c -O3 -march=native -static $(gsl-config --cflags --libs) -o GSL_pi && ./GSL_pi inplace\n\nPi: 3.141592\nTime: 0.061561\nRun Code Online (Sandbox Code Playgroud)\nThis is the JVM-platform code in question (written in Kotlin):
\npackage joml_pi\n\nimport org.joml.Vector4d\nimport kotlin.time.measureTimedValue\nimport kotlin.time.DurationUnit\n\n\nfun pibench(count: Int=1000000): Double {\n val d = Vector4d(1.0, 3.0, 5.0, 7.0)\n val w = Vector4d(1.0, -1.0, 1.0, -1.0)\n val c = Vector4d(1.0, 1.0, 1.0, 1.0)\n val scratchpad = Vector4d()\n var pi = 0.0\n for (i in 0..count) {\n scratchpad.set(i*8.0)\n scratchpad.add(d)\n c.div(scratchpad, scratchpad)\n scratchpad.mul(w)\n pi += scratchpad.x + scratchpad.y + scratchpad.z + scratchpad.w\n }\n return pi * 4.0\n}\n\n@kotlin.time.ExperimentalTime\nfun <T> benchmark(func: () -> T, name: String="", count: Int=5) {\n val times = mutableListOf<Double>()\n val results = mutableListOf<T>()\n for (i in 0..count) {\n val result = measureTimedValue<T>( { func() } )\n results.add(result.value)\n times.add(result.duration.toDouble(DurationUnit.SECONDS))\n }\n println(listOf<String>(\n "",\n name,\n "Results:",\n results.joinToString(", "),\n "Times:",\n times.joinToString(", ")\n ).joinToString("\\n"))\n}\n\n@kotlin.time.ExperimentalTime\nfun main(args: Array<String>) {\n benchmark<Double>(::pibench, "pibench")\n}\n\nRun Code Online (Sandbox Code Playgroud)\nThis is how I built and ran the JVM-platform code:
\n$ kotlinc -classpath joml-1.10.5.jar JOML_pi.kt && kotlin -classpath joml-1.10.5.jar:. joml_pi/JOML_piKt.class\n\npibench\nResults:\n3.1415924035900464, 3.1415924035900464, 3.1415924035900464, 3.1415924035900464, 3.1415924035900464, 3.1415924035900464\nTimes:\n0.026850784, 0.014998012, 0.013095291, 0.012805373, 0.012977388, 0.012948186\nRun Code Online (Sandbox Code Playgroud)\nThere are multiple possiblities I have considered for why this operation run in the JVM here is apparently several times faster than the equivalent C code. I do not think any of them are particularly compelling:
\n-O3 and -march=native, statically linking against libraries from Debian packages. In another case I even tried the -floop/-ftree parallelization flags. Either way performance didn\'t really change..jar file that I\'ve checked and contains only .class files, and the JNI adds ~20 Java ops of overhead to every call so even if it had magical native code that shouldn\'t help anyway at such a granular level.The only remaining explanation I can think of is that most of the time spent in C is overhead from doing function calls. This would seem consistent with the fact that implementations in C and Cython perform similarly. Then, the performance advantage of the Java/Kotlin/JVM implementation comes from its JIT being able to optimize away the function calls by effectively inlining everything in the loop. However, given the reputation of JIT compilers as being only theoretically, slightly faster than native code in favourable conditions, that still seems like a massive speedup just from having a JIT.
\nI suppose if that is the case, then a follow-up question would be whether I could realistically or reliably expect these performance characteristics to carry over outside of a synthetic toy benchmark, in applications that may have much more scattered numeric calls rather than a single million-iteration loop.
\n首先,免责声明:我是 JOML 的作者。
\n现在,您可能不会在这里将苹果与苹果进行比较。GSL 是一个通用线性代数库,支持许多不同的线性代数算法和数据结构。
\n另一方面,JOML 不是通用线性代数库,而是仅涵盖计算图形用例的专用库,因此它仅包含2、3 和 4 维向量的非常具体的类,并且2x2、3x3 和 4x4(和非方形变体)矩阵。\n换句话说,即使您想分配 5 维向量,也无法使用 JOML。
\n因此,JOML 中的所有算法和数据结构都是在具有 、 和 字段的类上显x式y设计z的w。没有任何循环。\n因此,4 维向量相加实际上就是:
dest.x = this.x + v.x;\ndest.y = this.y + v.y;\ndest.z = this.z + v.z;\ndest.w = this.w + v.w;\nRun Code Online (Sandbox Code Playgroud)\n甚至没有涉及任何 SIMD,因为到目前为止,还没有 JVM JIT 可以对类的不同字段进行自动矢量化。因此,现在的向量加法(或乘法;或任何通道方式)运算将准确地产生这些标量运算。
\n接下来,你说:
\n\n\nJOML实际上是使用本机代码。\xe2\x80\x94 自述文件说它不进行 JNI 调用,我直接从我检查过的多平台 .jar 文件导入,该文件仅包含 .class 文件,并且 JNI 添加了约 20 个 Java 操作每次调用的开销,因此即使它具有神奇的本机代码,在如此精细的级别上无论如何也不会有帮助。
\n
JOML 本身不通过 JNI 接口定义和使用本机代码。当然,JOML 内部使用的运算符和 JRE 方法将内在化为本机代码,但不是通过 JNI 接口。相反,所有方法(例如Math.fma())都将在 JIT 编译时直接内在化到其等效的机器代码中。
现在,正如其他人在对您的问题的评论中指出的那样:您正在使用链接库(而不是像 GLM 这样的仅标头库,它可能更适合您的 C/C++ 代码)。因此,C/C++ 编译器可能无法“透视”您的调用站点到被调用方,并根据调用站点上的静态信息应用优化(就像您使用gsl_vector_calloc参数调用一样4)。因此,GSL 需要执行的参数的每次运行时检查/分支仍然必须在运行时发生。这与使用仅包含头文件的库(如 GLM)时有很大不同,任何半体面的 C/C++ 肯定会根据调用/代码的静态知识来优化所有内容。我认为,是的,同等的 C/C++ 程序在速度上会击败 Java/Scala/Kotlin/JVM 程序。
= 1 + 2因此,GSL 和 JOML 的比较有点像比较 Microsoft Excel 评估单元格内容与编写有效输出 的 C 代码的性能printf("%f\\n", 1.0 + 2.0);。前者(Microsoft Excel,此处为 GSL)更加通用和通用,而后者(JOML)则高度专业化。
碰巧该专业化现在正好适合您的具体用例,甚至可以使用 JOML 来实现这一点。
\n