Dyl*_*ler 10 java sorting algorithm performance insertion-sort
我目前正在学习数据结构课程,正如您所料,我们要做的一件事就是编写一些常见的排序.在编写我的插入排序算法时,我发现它的运行速度明显快于我的教师的速度(对于400000个数据点,我的算法需要大约30秒,而他的大约需要90秒).我通过电子邮件向他发送了我的代码,当他们在同一台机器上运行时,会发生相同的结果.我们设法浪费了40多分钟,慢慢地将他的排序方法改为我的,直到它完全相同,一字不差,除了一个看似随意的事情.首先,这是我的插入排序代码:
public static int[] insertionSort(int[] A){
//Check for illegal cases
if (A == null || A.length == 0){
throw new IllegalArgumentException("A is not populated");
}
for(int i = 0; i < A.length; i++){
int j = i;
while(j > 0 && A[j - 1] > A[j]){
int temp = A[j];
A[j] = A[j - 1];
A[j - 1] = temp;
j--;
}
}
return A;
}
Run Code Online (Sandbox Code Playgroud)
现在,此时他的代码与我的代码完全相同,除了我们交换的行A[j]和A[j - 1].他的代码执行了以下操作:
int temp = A[j - 1];
A[j - 1] = A[j];
A[j] = temp;
Run Code Online (Sandbox Code Playgroud)
我们发现这3行是罪魁祸首.因此,我的代码运行速度明显加快.很困惑,我们跑去javap -c获取一个简单程序的字节代码,该程序只main包含一个数组声明,一个变量声明int j和用于交换的3行代码,当我编写它们并编写它们时.这是我的交换方法的字节代码:
Compiled from "me.java"
public class me {
public me();
Code:
0: aload_0
1: invokespecial #1 // Method java/lang/Object."<init>":()V
4: return
public static void main(java.lang.String[]);
Code:
0: sipush 10000
3: newarray int
5: astore_1
6: bipush 10
8: istore_2
9: aload_1
10: iload_2
11: iaload
12: istore_3
13: aload_1
14: iload_2
15: aload_1
16: iload_2
17: iconst_1
18: isub
19: iaload
20: iastore
21: aload_1
22: iload_2
23: iconst_1
24: isub
25: iload_3
26: iastore
27: return
}
Run Code Online (Sandbox Code Playgroud)
我的讲师方法的字节码:
Compiled from "instructor.java"
public class instructor {
public instructor();
Code:
0: aload_0
1: invokespecial #1 // Method java/lang/Object."<init>":()V
4: return
public static void main(java.lang.String[]);
Code:
0: sipush 10000
3: newarray int
5: astore_1
6: bipush 10
8: istore_2
9: aload_1
10: iload_2
11: iconst_1
12: isub
13: iaload
14: istore_3
15: aload_1
16: iload_2
17: iconst_1
18: isub
19: aload_1
20: iload_2
21: iaload
22: iastore
23: aload_1
24: iload_2
25: iload_3
26: iastore
27: return
}
Run Code Online (Sandbox Code Playgroud)
我没有看到这些字节码之间有任何真正的区别.什么可能导致这种奇怪的行为(我的代码仍然运行速度比他快3倍,并且预计这种差异会因为我们提供更大的算法而变得更加激烈)?这只是一个奇怪的Java怪癖.此外,这是否会发生在您的计算机上?作为参考,这是在2014年中期的MacBook Pro上运行的,我的代码与此处显示的完全相同,并且他的代码被推断为除了这3行之外的其他代码.
[编辑]这是我的测试类:
public class Tester1 {
public static void main(String[] args){
int[] A = new int[400000];
for(int i = 0; i < A.length; i++){
A[i] = (int) (Math.random() * Integer.MAX_VALUE);
}
double start = System.currentTimeMillis();
insertionSort(A);
System.out.println("My insertion sort took " + (System.currentTimeMillis() - start) + " milliseconds.");
}
public static int[] insertionSort(int[] A){
//Check for illegal cases
if (A == null || A.length == 0){
throw new IllegalArgumentException("A is not populated");
}
for(int i = 0; i < A.length; i++){
int j = i;
while(j > 0 && A[j - 1] > A[j]){
int temp = A[j];
A[j] = A[j - 1];
A[j - 1] = temp;
j--;
}
}
return A;
}
}
Run Code Online (Sandbox Code Playgroud)
第二个文件:
public class Tester2 {
public static void main(String[] args){
int[] A = new int[400000];
for(int i = 0; i < A.length; i++){
A[i] = (int) (Math.random() * Integer.MAX_VALUE);
}
double start = System.currentTimeMillis();
otherInsertion(A);
System.out.println("Other insertion sort took " + (System.currentTimeMillis() - start) + " milliseconds.");
}
public static int[] otherInsertion(int[] A){
//Check for illegal cases
if (A == null || A.length == 0){
throw new IllegalArgumentException("A is not populated");
}
for(int i = 0; i < A.length; i++){
int j = i;
while(j > 0 && A[j - 1] > A[j]){
int temp = A[j - 1];
A[j - 1] = A[j];
A[j] = temp;
j--;
}
}
return A;
}
}
Run Code Online (Sandbox Code Playgroud)
和输出(没有参数,只是java Tester1和java Tester2):
My insertion sort took 37680.0 milliseconds.
Other insertion sort took 86358.0 milliseconds.
Run Code Online (Sandbox Code Playgroud)
它们作为2个不同JVM中的2个独立文件运行.
这是循环展开优化以及常见子表达式消除的效果.根据阵列访问指令的顺序,JIT可以在一种情况下消除冗余负载,但在另一种情况下不能消除冗余负载.
让我详细解释一下.在这两种情况下,JIT都会展开内循环的4次迭代.
例如,你的情况:
while (j > 3) {
if (A[j - 1] > A[j]) {
int temp = A[j];
A[j] = A[j - 1];
A[j - 1] = temp; \
} A[j - 1] loaded immediately after store
if (A[j - 2] > A[j - 1]) { /
int temp = A[j - 1];
A[j - 1] = A[j - 2];
A[j - 2] = temp; \
} A[j - 2] loaded immediately after store
if (A[j - 3] > A[j - 2]) { /
int temp = A[j - 2];
A[j - 2] = A[j - 3];
A[j - 3] = temp; \
} A[j - 3] loaded immediately after store
if (A[j - 4] > A[j - 3]) { /
int temp = A[j - 3];
A[j - 3] = A[j - 4];
A[j - 4] = temp;
}
j -= 4;
}
Run Code Online (Sandbox Code Playgroud)
然后JIT消除了冗余阵列负载,并且得到的组件看起来像
0x0000000002d53a70: movslq %r11d,%r10
0x0000000002d53a73: lea 0x0(%rbp,%r10,4),%r10
0x0000000002d53a78: mov 0x10(%r10),%ebx ; ebx = A[j]
0x0000000002d53a7c: mov 0xc(%r10),%r9d ; r9d = A[j - 1]
0x0000000002d53a80: cmp %ebx,%r9d ; if (r9d > ebx) {
0x0000000002d53a83: jle 0x0000000002d539f3
0x0000000002d53a89: mov %r9d,0x10(%r10) ; A[j] = r9d
0x0000000002d53a8d: mov %ebx,0xc(%r10) ; A[j - 1] = ebx
; }
0x0000000002d53a91: mov 0x8(%r10),%r9d ; r9d = A[j - 2]
0x0000000002d53a95: cmp %ebx,%r9d ; if (r9d > ebx) {
0x0000000002d53a98: jle 0x0000000002d539f3
0x0000000002d53a9e: mov %r9d,0xc(%r10) ; A[j - 1] = r9d
0x0000000002d53aa2: mov %ebx,0x8(%r10) ; A[j - 2] = ebx
; }
0x0000000002d53aa6: mov 0x4(%r10),%r9d ; r9d = A[j - 3]
0x0000000002d53aaa: cmp %ebx,%r9d ; if (r9d > ebx) {
0x0000000002d53aad: jle 0x0000000002d539f3
0x0000000002d53ab3: mov %r9d,0x8(%r10) ; A[j - 2] = r9d
0x0000000002d53ab7: mov %ebx,0x4(%r10) ; A[j - 3] = ebx
; }
0x0000000002d53abb: mov (%r10),%r8d ; r8d = A[j - 4]
0x0000000002d53abe: cmp %ebx,%r8d ; if (r8d > ebx) {
0x0000000002d53ac1: jle 0x0000000002d539f3
0x0000000002d53ac7: mov %r8d,0x4(%r10) ; A[j - 3] = r8
0x0000000002d53acb: mov %ebx,(%r10) ; A[j - 4] = ebx
; }
0x0000000002d53ace: add $0xfffffffc,%r11d ; j -= 4
0x0000000002d53ad2: cmp $0x3,%r11d ; while (j > 3)
0x0000000002d53ad6: jg 0x0000000002d53a70
Run Code Online (Sandbox Code Playgroud)
循环展开后,教师的代码看起来会有所不同:
while (j > 3) {
if (A[j - 1] > A[j]) {
int temp = A[j - 1];
A[j - 1] = A[j];
A[j] = temp; <-- another store instruction between A[j - 1] access
}
if (A[j - 2] > A[j - 1]) {
int temp = A[j - 2];
A[j - 2] = A[j - 1];
A[j - 1] = temp;
}
...
Run Code Online (Sandbox Code Playgroud)
JVM不会消除后续的加载A[j - 1],因为在前一次加载之后还有另一个存储指令A[j - 1](尽管在这种特殊情况下,这种优化在理论上是可行的).
因此,汇编代码将有更多的加载指令,性能会更差:
0x0000000002b53a00: cmp %r8d,%r10d ; if (r10d > r8d) {
0x0000000002b53a03: jle 0x0000000002b53973
0x0000000002b53a09: mov %r8d,0xc(%rbx) ; A[j - 1] = r8d
0x0000000002b53a0d: mov %r10d,0x10(%rbx) ; A[j] = r10d
; }
0x0000000002b53a11: mov 0xc(%rbx),%r10d ; r10d = A[j - 1]
0x0000000002b53a15: mov 0x8(%rbx),%r9d ; r9d = A[j - 2]
0x0000000002b53a19: cmp %r10d,%r9d ; if (r9d > r10d) {
0x0000000002b53a1c: jle 0x0000000002b53973
0x0000000002b53a22: mov %r10d,0x8(%rbx) ; A[j - 2] = r10d
0x0000000002b53a26: mov %r9d,0xc(%rbx) ; A[j - 1] = r9d
; }
0x0000000002b53a2a: mov 0x8(%rbx),%r8d ; r8d = A[j - 2]
0x0000000002b53a2e: mov 0x4(%rbx),%r10d ; r10d = A[j - 3]
Run Code Online (Sandbox Code Playgroud)
注意,如果运行带有循环展开优化禁用(-XX:LoopUnrollLimit=0)的JVM ,则两种情况的性能都是相同的.
PS完全拆卸这两种方法都可以在这里获得,使用
-XX:CompileOnly=Test -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly
| 归档时间: |
|
| 查看次数: |
204 次 |
| 最近记录: |