与走数组相比,为什么Arrays.binarySearch没有提高性能?

Kor*_*gay 21 java arrays binary-search

试图解决Hackerland无线电发射机编程问题.

总而言之,挑战如下:

Hackerland是一维城市Ñ房屋,其中每个房子位于一些X 在x轴上.市长希望在城市房屋的屋顶上安装无线电发射器.每个发射机具有一个范围,ķ,这意味着它可以将信号发送到所有的房屋≤ ķ单位距离的路程.

鉴于Hackerland的地图和k的值,您能找到覆盖每个房屋所需的最小数量的发射器吗?

我的实现如下:

package biz.tugay;

import java.util.*;

public class HackerlandRadioTransmitters {

    public static int minNumOfTransmitters(int[] houseLocations, int transmitterRange) {
        // Sort and remove duplicates..
        houseLocations = uniqueHouseLocationsSorted(houseLocations);
        int towerCount = 0;
        for (int nextHouseNotCovered = 0; nextHouseNotCovered < houseLocations.length; ) {
            final int towerLocation = HackerlandRadioTransmitters.findNextTowerIndex(houseLocations, nextHouseNotCovered, transmitterRange);
            towerCount++;
            nextHouseNotCovered = HackerlandRadioTransmitters.nextHouseNotCoveredIndex(houseLocations, towerLocation, transmitterRange);
            if (nextHouseNotCovered == -1) {
                break;
            }
        }
        return towerCount;
    }

    public static int findNextTowerIndex(final int[] houseLocations, final int houseNotCoveredIndex, final int transmitterRange) {
        final int houseLocationWeWantToCover = houseLocations[houseNotCoveredIndex];
        final int farthestHouseLocationAllowed = houseLocationWeWantToCover + transmitterRange;
        int towerIndex = houseNotCoveredIndex;
        int loop = 0;
        while (true) {
            loop++;
            if (towerIndex == houseLocations.length - 1) {
                break;
            }
            if (farthestHouseLocationAllowed >= houseLocations[towerIndex + 1]) {
                towerIndex++;
                continue;
            }
            break;
        }
        System.out.println("findNextTowerIndex looped : " + loop);
        return towerIndex;
    }

    public static int nextHouseNotCoveredIndex(final int[] houseLocations, final int towerIndex, final int transmitterRange) {
        final int towerCoversUntil = houseLocations[towerIndex] + transmitterRange;
        int notCoveredHouseIndex = towerIndex + 1;
        int loop = 0;
        while (notCoveredHouseIndex < houseLocations.length) {
            loop++;
            final int locationOfHouseBeingChecked = houseLocations[notCoveredHouseIndex];
            if (locationOfHouseBeingChecked > towerCoversUntil) {
                break; // Tower does not cover the house anymore, break the loop..
            }
            notCoveredHouseIndex++;
        }
        if (notCoveredHouseIndex == houseLocations.length) {
            notCoveredHouseIndex = -1;
        }
        System.out.println("nextHouseNotCoveredIndex looped : " + loop);
        return notCoveredHouseIndex;
    }

    public static int[] uniqueHouseLocationsSorted(final int[] houseLocations) {
        Arrays.sort(houseLocations);
        final HashSet<Integer> integers = new HashSet<>();
        final int[] houseLocationsUnique = new int[houseLocations.length];

        int innerCounter = 0;
        for (int houseLocation : houseLocations) {
            if (integers.contains(houseLocation)) {
                continue;
            }
            houseLocationsUnique[innerCounter] = houseLocation;
            integers.add(houseLocationsUnique[innerCounter]);
            innerCounter++;
        }
        return Arrays.copyOf(houseLocationsUnique, innerCounter);
    }
}
Run Code Online (Sandbox Code Playgroud)

我很确定这个实现是正确的.但请查看函数中的详细信息:findNextTowerIndexnextHouseNotCoveredIndex:它们逐个遍历数组!

我的一项测试如下:

static void test_01() throws FileNotFoundException {
    final long start = System.currentTimeMillis();
    final File file = new File("input.txt");
    final Scanner scanner = new Scanner(file);
    int[] houseLocations = new int[73382];
    for (int counter = 0; counter < 73382; counter++) {
        houseLocations[counter] = scanner.nextInt();
    }
    final int[] uniqueHouseLocationsSorted = HackerlandRadioTransmitters.uniqueHouseLocationsSorted(houseLocations);
    final int minNumOfTransmitters = HackerlandRadioTransmitters.minNumOfTransmitters(uniqueHouseLocationsSorted, 73381);
    assert minNumOfTransmitters == 1;
    final long end = System.currentTimeMillis();
    System.out.println("Took: " + (end - start) + " milliseconds..");
}
Run Code Online (Sandbox Code Playgroud)

其中input.txt可以从这里下载.(这不是这个问题中最重要的细节,但仍然是..)所以我们有73382个房子的阵列,我故意设置发射器范围,所以我循环的方法很多:

以下是我的机器中此测试的示例输出:

findNextTowerIndex looped : 38213
nextHouseNotCoveredIndex looped : 13785
Took: 359 milliseconds..
Run Code Online (Sandbox Code Playgroud)

我也有这个测试,它没有断言任何东西,只是保持时间:

static void test_02() throws FileNotFoundException {
    final long start = System.currentTimeMillis();
    for (int i = 0; i < 400; i ++) {
        final File file = new File("input.txt");
        final Scanner scanner = new Scanner(file);
        int[] houseLocations = new int[73382];
        for (int counter = 0; counter < 73382; counter++) {
            houseLocations[counter] = scanner.nextInt();
        }
        final int[] uniqueHouseLocationsSorted = HackerlandRadioTransmitters.uniqueHouseLocationsSorted(houseLocations);

        final int transmitterRange = ThreadLocalRandom.current().nextInt(1, 70000);
        final int minNumOfTransmitters = HackerlandRadioTransmitters.minNumOfTransmitters(uniqueHouseLocationsSorted, transmitterRange);
    }
    final long end = System.currentTimeMillis();
    System.out.println("Took: " + (end - start) + " milliseconds..");
}
Run Code Online (Sandbox Code Playgroud)

我随机创建400个变送器范围,并运行程序400次..我将在我的机器中获得如下运行时间..

Took: 20149 milliseconds..
Run Code Online (Sandbox Code Playgroud)

所以现在,我说,为什么不使用二进制搜索而不是走数组并改变我的实现如下:

public static int findNextTowerIndex(final int[] houseLocations, final int houseNotCoveredIndex, final int transmitterRange) {
    final int houseLocationWeWantToCover = houseLocations[houseNotCoveredIndex];
    final int farthestHouseLocationAllowed = houseLocationWeWantToCover + transmitterRange;
    int nextTowerIndex = Arrays.binarySearch(houseLocations, 0, houseLocations.length, farthestHouseLocationAllowed);

    if (nextTowerIndex < 0) {
        nextTowerIndex = -nextTowerIndex;
        nextTowerIndex = nextTowerIndex -2;
    }

    return nextTowerIndex;
}

public static int nextHouseNotCoveredIndex(final int[] houseLocations, final int towerIndex, final int transmitterRange) {
    final int towerCoversUntil = houseLocations[towerIndex] + transmitterRange;
    int nextHouseNotCoveredIndex = Arrays.binarySearch(houseLocations, 0, houseLocations.length, towerCoversUntil);

    if (-nextHouseNotCoveredIndex > houseLocations.length) {
        return -1;
    }

    if (nextHouseNotCoveredIndex < 0) {
        nextHouseNotCoveredIndex = - (nextHouseNotCoveredIndex + 1);
        return nextHouseNotCoveredIndex;
    }

    return nextHouseNotCoveredIndex + 1;
}
Run Code Online (Sandbox Code Playgroud)

我期待一个很好的性能提升,因为现在我将最多循环log(N)次,而不是O(N)..所以test_01输出:

Took: 297 milliseconds..
Run Code Online (Sandbox Code Playgroud)

记住,它是Took:359毫秒..之前.对于test_02:

Took: 18047 milliseconds..
Run Code Online (Sandbox Code Playgroud)

因此,对于二进制搜索实现,我总是在数组步行实现时获得大约20秒的值,在18到19秒时获得值.

我期待使用Arrays.binarySearch获得更好的性能提升,但显然事实并非如此,为什么会这样呢?我错过了什么?我是否需要超过73382的阵列才能看到好处,或者它是否无关紧要?

编辑#01

在@huck_cussler的评论之后,我尝试将我拥有的数据集加倍和三倍(随机数)并尝试运行test02(当然在测试本身中将数组大小增加三倍......).对于线性实现,时间如下:

Took: 18789 milliseconds..
Took: 34396 milliseconds..
Took: 53504 milliseconds..
Run Code Online (Sandbox Code Playgroud)

对于二进制搜索实现,我得到如下值:

Took: 18644 milliseconds..
Took: 33831 milliseconds..
Took: 52886 milliseconds..
Run Code Online (Sandbox Code Playgroud)

pha*_*ers 15

您的时间安排包括从硬盘驱动器中检索数据.这可能占用了大部分运行时间.省略时间上的数据负载,以便更准确地比较两种方法.想象一下,如果它需要18秒,你比较18.644对比18.789(改善0.77%)而不是0.644对比0.789(改善18.38%).

如果您有线性操作O(n),例如加载二进制结构,并将它与二进制搜索O(log n)组合,则最终得到O(n).如果您信任Big O表示法,那么您应该期望O(n + log n)与O(2*n)没有明显不同,因为它们都减少到O(n).

而且,取决于塔之间的房屋密度,二元搜索可以比线性搜索更好或更差.比如说,有1024个家庭,每4个家庭均匀分布一个塔.线性搜索将每塔步进4次,而二进制搜索将采用log2(1024)=每塔10步.

还有一件事......你的minNumOfTransmitters方法是从test_01和传递给它的已经排序的数组test_02.求助步骤比搜索本身花费的时间更长,这进一步模糊了两种搜索算法之间的时序差异.

======

我创建了一个小型计时课程,以便更好地了解正在发生的事情.我从minNumOfTransmitters中删除了代码行,以防止它重新运行排序,并添加了一个布尔参数来选择是否使用二进制版本.它总计400次迭代的总和,将每一步分开.我的系统上的结果表明,加载时间使排序时间相形见绌,这反过来使解决时间相形见绌.

  Load:  22.565s
  Sort:   4.518s
Linear:   0.012s
Binary:   0.003s
Run Code Online (Sandbox Code Playgroud)

很容易看出最后一步的优化如何在整体运行时间上没有太大差异.

private static class Timing {
    public long load=0;
    public long sort=0;
    public long solve1=0;
    public long solve2=0;
    private String secs(long millis) {
        return String.format("%3d.%03ds", millis/1000, millis%1000);
    }
    public String toString() {
        return "  Load: " + secs(load) + "\n  Sort: " + secs(sort) + "\nLinear: " + secs(solve1) + "\nBinary: " + secs(solve2);
    }
    public void add(Timing timing) {
        load+=timing.load;
        sort+=timing.sort;
        solve1+=timing.solve1;
        solve2+=timing.solve2;
    }
}

static Timing test_01() throws FileNotFoundException {
    Timing timing=new Timing();
    long start = System.currentTimeMillis();
    final File file = new File("c:\\path\\to\\xnpwdiG3.txt");
    final Scanner scanner = new Scanner(file);
    int[] houseLocations = new int[73382];
    for (int counter = 0; counter < 73382; counter++) {
        houseLocations[counter] = scanner.nextInt();
    }
    timing.load+=System.currentTimeMillis()-start;
    start=System.currentTimeMillis();
    final int[] uniqueHouseLocationsSorted = HackerlandRadioTransmitters.uniqueHouseLocationsSorted(houseLocations);
    timing.sort=System.currentTimeMillis()-start;
    start=System.currentTimeMillis();
    final int minNumOfTransmitters = HackerlandRadioTransmitters.minNumOfTransmitters(uniqueHouseLocationsSorted, 73381, false);
    timing.solve1=System.currentTimeMillis()-start;
    start=System.currentTimeMillis();
    final int minNumOfTransmittersBin = HackerlandRadioTransmitters.minNumOfTransmitters(uniqueHouseLocationsSorted, 73381, true);
    timing.solve2=System.currentTimeMillis()-start;
    final long end = System.currentTimeMillis();
    return timing;
}
Run Code Online (Sandbox Code Playgroud)


Dor*_*old 7

在您的时间测量中,您包含的操作比数组搜索慢得多.即文件系统I/O和数组排序.一般的I/O(从文件系统读取/写入,网络通信)比仅涉及CPU和RAM访问的操作慢几个数量级.

让我们以在每次循环迭代中不读取文件的方式重写您的测试:

static void test_02() throws FileNotFoundException {
        final File file = new File("input.txt");
        final Scanner scanner = new Scanner(file);
        int[] houseLocations = new int[73382];
        for (int counter = 0; counter < 73382; counter++) {
            houseLocations[counter] = scanner.nextInt();
        }
        scanner.close();
        final int rounds = 400;
        final int[] uniqueHouseLocationsSorted = uniqueHouseLocationsSorted(houseLocations);
        final int transmitterRange = 73381;
        final long start = System.currentTimeMillis();
        for (int i = 0; i < rounds; i++) {
            final int minNumOfTransmitters = minNumOfTransmitters(uniqueHouseLocationsSorted, transmitterRange);
        }
        final long end = System.currentTimeMillis();
        System.out.println("Took: " + (end - start) + " milliseconds..");
}
Run Code Online (Sandbox Code Playgroud)

请注意,在此版本的测试中,文件只读一次,之后开始计时.有了上述内容,我得到Took: 1700 milliseconds..(或多或少几毫秒)迭代版本和二进制搜索.所以我们仍然看不到二进制搜索更快.那是因为几乎所有的时间都用于对数组进行400次排序.

现在让我们删除从minNumOfTransmitters方法中对输入数组进行排序的行.我们在测试开始时对数组进行排序(一次).

现在我们可以看到事情要快得多.houseLocations = uniqueHouseLocationsSorted(houseLocations)minNumOfTransmitters我得到的行中删除:Took: 68 milliseconds..为迭代版本.显然,由于此持续时间已经非常小,我们不会发现二进制搜索版本存在显着差异.

所以让我们将循环次数增加到:100000.
现在我得到Took: 2121 milliseconds..了迭代版本和Took: 36 milliseconds..二进制搜索版本.

因为我们现在隔离了我们测量的内容并专注于数组搜索,而不是包括慢得多的操作,我们可以注意到二进制搜索的性能(更好)的巨大差异.

如果您想查看二进制搜索进入其while循环的次数,您可以自己实现它并添加一个计数器:

private static int binarySearch0(int[] a, int fromIndex, int toIndex, int key) {
        int low = fromIndex;
        int high = toIndex - 1;
        int loop = 0;
        while (low <= high) {
            loop++;
            int mid = (low + high) >>> 1;
            int midVal = a[mid];

            if (midVal < key) {
                low = mid + 1;
            } else if (midVal > key) {
                high = mid - 1;
            } else {
                return mid; // key found
            }
        }
        System.out.println("binary search looped " + loop + " times");
        return -(low + 1);  // key not found.
}
Run Code Online (Sandbox Code Playgroud)

该方法是从JDK中的Arrays类复制的 - 我刚刚添加了循环计数器和println.
当要搜索的数组长度为73382时,循环仅输入16次.这正是我们所期望的:log(73382) =~ 16.