为什么这个 JOML (JVM) 代码比等效的 GSL (C) 代码快得多?

Wil*_*hen 5 c java performance benchmarking jvm

我正在尝试优化一个小型库来对向量进行算术运算。

\n

To roughly check my progress, I decided to benchmark the performance of two popular vector arithmetic libraries written in two different languages, the GNU Scientific Library (GSL, C), and the Java OpenGL Math Library (JOML, JVM). I expected GSL, as a large project written in C and compiled ahead of time, to be significantly faster than JOML, with extra baggage from object management, method calls, and conforming to the Java specifications.

\n

Surprisingly, instead JOML (JVM) ended up being around 4X faster than GSL (C). I wish to understand why this is the case.

\n

The benchmark I performed was to compute 4,000,000 iterations of Leibniz\'s formula to calculate Pi, in chunks of 4 at a time via 4-dimensioned vectors. The exact algorithm doesn\'t matter, and doesn\'t have to make sense. This was just the first and simplest thing I thought of that would let me use multiple vector operations per iteration.

\n

This is the C code in question:

\n
#include <stdio.h>\n#include <time.h>\n#include <gsl/gsl_vector.h>\n#include <unistd.h>\n#include <math.h>\n#include <string.h>\n\n#define IT 1000000\n\ndouble pibench_inplace(int it) {\n    gsl_vector* d = gsl_vector_calloc(4);\n    gsl_vector* w = gsl_vector_calloc(4);\n    for (int i=0; i<4; i++) {\n        gsl_vector_set(d, i, (double)i*2+1);\n        gsl_vector_set(w, i, (i%2==0) ? 1 : -1);\n    }\n    gsl_vector* b = gsl_vector_calloc(4);\n    double pi = 0.0;\n    for (int i=0; i<it; i++) {\n        gsl_vector_memcpy(b, d);\n        gsl_vector_add_constant(b, (double)i*8);\n        for (int i=0; i<4; i++) {\n            gsl_vector_set(b, i, pow(gsl_vector_get(b, i), -1.));\n        }\n        gsl_vector_mul(b, w);\n        pi += gsl_vector_sum(b);\n    }\n    return pi*4;\n}\n\ndouble pibench_fast(int it) {\n    double pi = 0;\n    int eq_it = it * 4;\n    for (int i=0; i<eq_it; i++) {\n        pi += (1 / ((double)i * 2 + 1) * ((i%2==0) ? 1 : -1));\n    }\n    return pi*4;\n}\n\nint main(int argc, char* argv[]) {\n    if (argc < 2) {\n        printf("Please specific a run mode.\\n");\n        return 1;\n    }\n    double pi;\n    struct timespec start = {0,0}, end={0,0};\n    clock_gettime(CLOCK_MONOTONIC, &start);\n    if (strcmp(argv[1], "inplace") == 0) {\n        pi = pibench_inplace(IT);\n    } else if (strcmp(argv[1], "fast") == 0) {\n        pi = pibench_fast(IT);\n    } else {\n        sleep(1);\n        printf("Please specific a valid run mode.\\n");\n    }\n    clock_gettime(CLOCK_MONOTONIC, &end);\n    printf("Pi: %f\\n", pi);\n    printf("Time: %f\\n", ((double)end.tv_sec + 1.0e-9*end.tv_nsec) - ((double)start.tv_sec + 1.0e-9*start.tv_nsec));\n    return 0;\n}\n\n
Run Code Online (Sandbox Code Playgroud)\n

This is how I built and ran the C code:

\n
$ gcc GSL_pi.c -O3 -march=native -static $(gsl-config --cflags --libs) -o GSL_pi && ./GSL_pi inplace\n\nPi: 3.141592\nTime: 0.061561\n
Run Code Online (Sandbox Code Playgroud)\n

This is the JVM-platform code in question (written in Kotlin):

\n
package joml_pi\n\nimport org.joml.Vector4d\nimport kotlin.time.measureTimedValue\nimport kotlin.time.DurationUnit\n\n\nfun pibench(count: Int=1000000): Double {\n    val d = Vector4d(1.0, 3.0, 5.0, 7.0)\n    val w = Vector4d(1.0, -1.0, 1.0, -1.0)\n    val c = Vector4d(1.0, 1.0, 1.0, 1.0)\n    val scratchpad = Vector4d()\n    var pi = 0.0\n    for (i in 0..count) {\n        scratchpad.set(i*8.0)\n        scratchpad.add(d)\n        c.div(scratchpad, scratchpad)\n        scratchpad.mul(w)\n        pi += scratchpad.x + scratchpad.y + scratchpad.z + scratchpad.w\n    }\n    return pi * 4.0\n}\n\n@kotlin.time.ExperimentalTime\nfun <T> benchmark(func: () -> T, name: String="", count: Int=5) {\n    val times = mutableListOf<Double>()\n    val results = mutableListOf<T>()\n    for (i in 0..count) {\n        val result = measureTimedValue<T>( { func() } )\n        results.add(result.value)\n        times.add(result.duration.toDouble(DurationUnit.SECONDS))\n    }\n    println(listOf<String>(\n            "",\n            name,\n            "Results:",\n            results.joinToString(", "),\n            "Times:",\n            times.joinToString(", ")\n    ).joinToString("\\n"))\n}\n\n@kotlin.time.ExperimentalTime\nfun main(args: Array<String>) {\n    benchmark<Double>(::pibench, "pibench")\n}\n\n
Run Code Online (Sandbox Code Playgroud)\n

This is how I built and ran the JVM-platform code:

\n
$ kotlinc -classpath joml-1.10.5.jar JOML_pi.kt && kotlin -classpath joml-1.10.5.jar:. joml_pi/JOML_piKt.class\n\npibench\nResults:\n3.1415924035900464, 3.1415924035900464, 3.1415924035900464, 3.1415924035900464, 3.1415924035900464, 3.1415924035900464\nTimes:\n0.026850784, 0.014998012, 0.013095291, 0.012805373, 0.012977388, 0.012948186\n
Run Code Online (Sandbox Code Playgroud)\n

There are multiple possiblities I have considered for why this operation run in the JVM here is apparently several times faster than the equivalent C code. I do not think any of them are particularly compelling:

\n
    \n
  • I\'m doing different iteration counts by an order of magnitude in the two languages. \xe2\x80\x94 It\'s possible I\'m grossly misreading the code, but I\'m pretty sure this isn\'t the case.
  • \n
  • I\'ve fudged up the algorithm and am doing vastly different things in each case. \xe2\x80\x94 Again maybe I\'ve misread it, but I don\'t think that\'s happening, and both cases do produce numerically correct results.
  • \n
  • The timing mechanism I use for C introduces a lot of overhead. \xe2\x80\x94 I also tested simpler and no-op functions. They completed and were measured as expected in much less time.
  • \n
  • The JVM code is parallelized across multiple processor cores \xe2\x80\x94 With many more iterations, I watched my CPU use over a longer period and it did not exceed one core.
  • \n
  • The JVM code makes better use of SIMD/vectorization. \xe2\x80\x94 I compiled the C with -O3 and -march=native, statically linking against libraries from Debian packages. In another case I even tried the -floop/-ftree parallelization flags. Either way performance didn\'t really change.
  • \n
  • GSL has extra features that add overhead in this particular test. \xe2\x80\x94 I also have another version, with the vector class implemented and used through Cython, that does only the basics (iterating over a pointer), and performs roughly equivalently to GSL (with slightly more overhead, as expected). So that seems to be the limit for native code.
  • \n
  • JOML is actually using native code. \xe2\x80\x94 The README says it makes no JNI calls, I\'m importing directly from a multi-platform .jar file that I\'ve checked and contains only .class files, and the JNI adds ~20 Java ops of overhead to every call so even if it had magical native code that shouldn\'t help anyway at such a granular level.
  • \n
  • The JVM has different specifics for floating point arithmetic. \xe2\x80\x94 The JOML class I used accepts and returns "doubles" just as the C code. In any case, having to emulate a specification that deviates from hardware capabilities probably shouldn\'t improve performance like this.
  • \n
  • The exponential reciprocal step in my GSL code is less efficient than the division reciprocal step in my JOML code. \xe2\x80\x94 While commenting that out does reduce total execution time by around 25% (to ~0.045s), that still leaves a massive 3X gap with the JVM code (~0.015s).
  • \n
\n

The only remaining explanation I can think of is that most of the time spent in C is overhead from doing function calls. This would seem consistent with the fact that implementations in C and Cython perform similarly. Then, the performance advantage of the Java/Kotlin/JVM implementation comes from its JIT being able to optimize away the function calls by effectively inlining everything in the loop. However, given the reputation of JIT compilers as being only theoretically, slightly faster than native code in favourable conditions, that still seems like a massive speedup just from having a JIT.

\n

I suppose if that is the case, then a follow-up question would be whether I could realistically or reliably expect these performance characteristics to carry over outside of a synthetic toy benchmark, in applications that may have much more scattered numeric calls rather than a single million-iteration loop.

\n

Kai*_*ack 7

首先,免责声明:我是 JOML 的作者。

\n

现在,您可能不会在这里将苹果与苹果进行比较。GSL 是一个通用线性代数库,支持许多不同的线性代数算法和数据结构。

\n

另一方面,JOML 不是通用线性代数库,而是涵盖计算图形用例的专用库,因此它包含2、3 和 4 维向量的非常具体的类,并且2x2、3x3 和 4x4(和非方形变体)矩阵。\n换句话说,即使您想分配 5 维向量,也无法使用 JOML。

\n

因此,JOML 中的所有算法和数据结构都是在具有 、 和 字段的类上显xy设计zw。没有任何循环。\n因此,4 维向量相加实际上就是:

\n
dest.x = this.x + v.x;\ndest.y = this.y + v.y;\ndest.z = this.z + v.z;\ndest.w = this.w + v.w;\n
Run Code Online (Sandbox Code Playgroud)\n

甚至没有涉及任何 SIMD,因为到目前为止,还没有 JVM JIT 可以对类的不同字段进行自动矢量化。因此,现在的向量加法(或乘法;或任何通道方式)运算将准确地产生这些标量运算。

\n

接下来,你说:

\n
\n

JOML实际上是使用本机代码。\xe2\x80\x94 自述文件说它不进行 JNI 调用,我直接从我检查过的多平台 .jar 文件导入,该文件仅包含 .class 文件,并且 JNI 添加了约 20 个 Java 操作每次调用的开销,因此即使它具有神奇的本机代码,在如此精细的级别上无论如何也不会有帮助。

\n
\n

JOML 本身不通过 JNI 接口定义和使用本机代码。当然,JOML 内部使用的运算符和 JRE 方法将内在化为本机代码,但不是通过 JNI 接口。相反,所有方法(例如Math.fma())都将在 JIT 编译时直接内在化到其等效的机器代码中。

\n

现在,正如其他人在对您的问题的评论中指出的那样:您正在使用链接库(而不是像 GLM 这样的仅标头库,它可能更适合您的 C/C++ 代码)。因此,C/C++ 编译器可能无法“透视”您的调用站点到被调用方,并根据调用站点上的静态信息应用优化(就像您使用gsl_vector_calloc参数调用一样4)。因此,GSL 需要执行的参数的每次运行时检查/分支仍然必须在运行时发生。这与使用仅包含头文件的库(如 GLM)时有很大不同,任何半体面的 C/C++ 肯定会根据调用/代码的静态知识来优化所有内容。我认为,是的,同等的 C/C++ 程序在速度上会击败 Java/Scala/Kotlin/JVM 程序。

\n

= 1 + 2因此,GSL 和 JOML 的比较有点像比较 Microsoft Excel 评估单元格内容与编写有效输出 的 C 代码的性能printf("%f\\n", 1.0 + 2.0);。前者(Microsoft Excel,此处为 GSL)更加通用和通用,而后者(JOML)则高度专业化。

\n

碰巧该专业化现在正好适合您的具体用例,甚至可以使用 JOML 来实现这一点。

\n