chi*_*ity 48 java performance file-io
Suppose you have a large ASCII text file, with a random non-negative integer on each line, each in the range from 0 to 1,000,000,000. There are 100,000,000 lines in the file. What's the fastest way to read through the file and calculate the sum of all the integers?
Constraint: we've got 10MB of RAM to work with. The file is 1GB in size, so we don't want to read the whole thing in and then process it.
Here are various solutions I've tried. I found the results rather surprising.
Is there anything faster that I've missed?
Please note: all timings given below are for running the algorithm 10 times in total (run once and discard; start timer; run 10 times; stop timer). The machine is a fairly slow Core 2 Duo.
首先要尝试的是明显的方法:
private long sumLineByLine() throws NumberFormatException, IOException {
    BufferedReader br = new BufferedReader(new FileReader(file));
    String line;
    long total = 0;
    while ((line = br.readLine()) != null) {
        int k = Integer.parseInt(line);
        total += k;
    }
    br.close();
    return total;
}
请注意,最大可能的返回值是10 ^ 17,它仍然很容易适合a long,所以我们不必担心溢出.
在我的机器上,运行这11次并打折第一次运行需要大约92.9秒.
受到对这个问题的评论的启发,我尝试不创建一个新的int k来存储解析该行的结果,而只是直接将解析后的值添加到total.所以这:
    while ((line = br.readLine()) != null) {
        int k = Integer.parseInt(line);
        total += k;
    }
成为这个:
    while ((line = br.readLine()) != null)
        total += Integer.parseInt(line);
我确信这不会有任何区别,并且认为编译器很可能会为这两个版本生成相同的字节码.但是,令我惊讶的是,它确实刮了一点时间:我们降到了92.1秒.
到目前为止,困扰我的一件事是我们把它String变成了一个int,然后在最后添加它.我们去的时候可能不会加快速度吗?如果我们解析String自己会发生什么?像这样......
private long sumLineByLineManualParse() throws NumberFormatException,
        IOException {
    BufferedReader br = new BufferedReader(new FileReader(file));
    String line;
    long total = 0;
    while ((line = br.readLine()) != null) {
        char chs[] = line.toCharArray();
        int mul = 1;
        for (int i = chs.length - 1; i >= 0; i--) {
            char c = chs[i];
            switch (c) {
            case '0':
                break;
            case '1':
                total += mul;
                break;
            case '2':
                total += (mul << 1);
                break;
            case '4':
                total += (mul << 2);
                break;
            case '8':
                total += (mul << 3);
                break;
            default:
                total += (mul*((byte) c - (byte) ('0')));   
            }
            mul*=10;
        }
    }
    br.close();
    return total;
}
我想,这可能会节省一点时间,特别是对于进行乘法的一些比特优化.但转换为字符数组的开销必须淹没任何收益:现在需要148.2秒.
One last thing we can try is to process the file as binary data.
Parsing an integer from the front is awkward if you don't know the length of it. Parsing it backwards is much easier: the first digit you encounter is units, the next one is tens, and so on. So the easiest way to approach the whole thing is to read the file backwards.
If we allocate a byte[] buffer of (say) 8MB, we can fill it up with the last 8MB of the file, process it, then read the preceding 8MB, and so on. We need to be a little careful that we don't screw up a number that we're in the middle of parsing when we move to the next block, but that's the only problem.
When we encounter a digit, we add it (suitably multiplied according to its position in the numeral) to the total, and then multiply the coefficient by 10 so we're ready for the next digit. If we encounter anything that isn't a digit (a CR or LF), we just reset the coefficient.
private long sumBinary() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    int lastRead = (int) raf.length();
    byte buf[] = new byte[8*1024*1024];
    int mul = 1;
    long total = 0;
    while (lastRead>0) {
        int len = Math.min(buf.length, lastRead);
        raf.seek(lastRead-len);
        raf.readFully(buf, 0, len);
        lastRead-=len;
        for (int i=len-1; i>=0; i--) {
            //48 is '0' and 57 is '9'
            if ((buf[i]>=48) && (buf[i]<=57)) {
                total+=mul*(buf[i]-48);
                mul*=10;
            } else
                mul=1;
        }
    }
    raf.close();
    return total;
}
This runs in 30.8 seconds! That's a speed increase by a factor of 3 over the previous best.
String? And all the worrying behind the scenes about character sets and the like?MappedByteBuffer to help? I have a feeling that the overheads of invoking methods to read from the buffer would slow things down, especially when reading backwards from the buffer.First, an observation. It should have occurred to me before, but I think the reason for the inefficiency of the String-based reading is not so much the time taken to create all the String objects but the fact that they are so short-lived: we've got 100,000,000 of them for the garbage collector to deal with. That is bound to upset it.
Now some experiments based on answers/comments people have posted.
One suggestion was that since a BufferedReader uses a default buffer of 16KB, and I've used a buffer of 8MB, I'm not comparing like with like. It's bound to be faster if you use a bigger buffer.
Here's the shock. The sumBinary() method (Method 4) ran in 30.8 seconds yesterday with an 8MB buffer. Today, code unchanged, the wind direction has changed and we're at 30.4 seconds. If I drop the buffer size down to 16KB to see how much slower it gets, it gets faster! It now runs in 23.7 seconds. Crazy. Who saw that one coming?!
A bit of experimentation suggests that 16KB is about optimal. Perhaps the Java guys did the same experiments, and that's why they went with 16KB!
I wondered about this too. How much time is spent on disk access, and how much on number crunching? If it's almost all disk access, as suggested by a well-supported comment on one of the proposed answers, then we won't be able to make much improvement whatever we do.
This is easy to test by running the code with all the parsing and number crunching commented out, but with the reading still intact:
private long sumBinary() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    int lastRead = (int) raf.length();
    byte buf[] = new byte[16 * 1024];
    int mul = 1;
    long total = 0;
    while (lastRead > 0) {
        int len = Math.min(buf.length, lastRead);
        raf.seek(lastRead - len);
        raf.readFully(buf, 0, len);
        lastRead -= len;
        /*for (int i = len - 1; i >= 0; i--) {
            if ((buf[i] >= 48) && (buf[i] <= 57)) {
                total += mul * (buf[i] - 48);
                mul *= 10;
            } else
                mul = 1;
        }*/
    }
    raf.close();
    return total;
}
This now runs in 3.7 seconds! This doesn't look I/O-bound to me.
Of course, some of the I/O speed will come from disk cache hits. But that isn't really the point here: we're still taking 20 seconds of CPU time (also confirmed using Linux's time command), which is plenty big enough to try to reduce it.
I'd maintained in my original post that there was good reason to scan the file backwards rather than forwards. I didn't explain that very well. The idea was that if you scan a number forwards, you have to accumulate the total value of the scanned number, and then add it on. If you scan backwards, you can add it to the cumulative total as you go. My subconscious was making some sort of sense to itself (on which more later), but I'd missed one key point, which was pointed out in one of the answers: to scan backwards, I was doing two multiplications per iteration, but with scanning forwards you need only one. So I coded up a forward-scanning version:
private long sumBinaryForward() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    int fileLength = (int) raf.length();
    byte buf[] = new byte[16 * 1024];
    int acc = 0;
    long total = 0;
    int read = 0;
    while (read < fileLength) {
        int len = Math.min(buf.length, fileLength - read);
        raf.readFully(buf, 0, len);
        read += len;
        for (int i = 0; i < len; i++) {
            if ((buf[i] >= 48) && (buf[i] <= 57))
                acc = acc * 10 + buf[i] - 48;
            else {
                total += acc;
                acc = 0;
            }
        }
    }
    raf.close();
    return total;
}
This runs in 20.0 seconds, beating the backward-scanning version by a distance. Nice.
What I realised during the night, though, was that although I was performing two multiplications per iteration, there was the possibility of using a cache to store these multiplications, so that I could avoid having to perform them during backwards iteration. I was pleased to see when I woke up that someone had had the same idea!
The point is that there are at most 10 digits in the numbers we're scanning, and only 10 possible digits, so only 100 possibilities for the value of a digit to the cumulative total. We can precompute these, and then use them in the backward-scanning code. That ought to beat the forward-scanning version, because we've now got rid of the multiplications entirely. (Note that we can't do this with forward scanning, because the multiplication is of the accumulator, which could take any value up to 10^9. It's only in the backward case that both operands are limited to a few possibilities.)
private long sumBinaryCached() throws IOException {
    int mulCache[][] = new int[10][10];
    int coeff = 1;
    for (int i = 0; i < 10; i++) {
        for (int j = 0; j < 10; j++)
            mulCache[i][j] = coeff * j;
        coeff *= 10;
    }
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    int lastRead = (int) raf.length();
    byte buf[] = new byte[16 * 1024];
    int mul = 0;
    long total = 0;
    while (lastRead > 0) {
        int len = Math.min(buf.length, lastRead);
        raf.seek(lastRead - len);
        raf.readFully(buf, 0, len);
        lastRead -= len;
        for (int i = len - 1; i >= 0; i--) {
            if ((buf[i] >= 48) && (buf[i] <= 57))
                total += mulCache[mul++][buf[i] - 48];
            else
                mul = 0;
        }
    }
    raf.close();
    return total;
}
This runs in 26.1 seconds. Disappointing, to say the least. Reading backwards is less efficient in terms of I/O, but we've seen that I/O is not the major headache here. I had expected this to make a big positive difference. Perhaps the array lookup is just as expensive as the multiplications we've replaced. (I did try making the array 16x16, and using bitshifts to index, but it didn't help.)
Looks like forward scanning is where it's at.
Next thing to add in is a MappedByteBuffer, to see if that's more efficient than using a raw RandomAccessFile. It doesn't need much change to the code.
private long sumBinaryForwardMap() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    byte buf[] = new byte[16 * 1024];
    final FileChannel ch = raf.getChannel();
    int fileLength = (int) ch.size();
    final MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, 0,
            fileLength);
    int acc = 0;
    long total = 0;
    while (mb.hasRemaining()) {
        int len = Math.min(mb.remaining(), buf.length);
        mb.get(buf, 0, len);
        for (int i = 0; i < len; i++)
            if ((buf[i] >= 48) && (buf[i] <= 57))
                acc = acc * 10 + buf[i] - 48;
            else {
                total += acc;
                acc = 0;
            }
    }
    ch.close();
    raf.close();
    return total;
}
This does seem to improve things a little: we're now at 19.0 seconds. We've taken another second off our personal best!
One of the proposed answers involves using multiple cores. I'm a little ashamed that that hadn't occurred to me!
The answer came in for some stick, because of the assumption that it's an I/O-bound problem. This seems a little harsh, in light of the results about I/O! Certainly worth a try, in any case.
We'll do this using fork/join. Here's a class to represent the result of a computation on part of the file, bearing in mind that there might be a partial result to the left (if we started half way through a number), and a partial result to the right (if the buffer finished half way through a number). The class also has a method for allowing us to glue two such results together, into a combined result for two adjacent sub-tasks.
private class SumTaskResult {
    long subtotal;
    int leftPartial;
    int leftMulCount;
    int rightPartial;
    public void append(SumTaskResult rightward) {
        subtotal += rightward.subtotal + rightPartial
                * rightward.leftMulCount + rightward.leftPartial;
        rightPartial = rightward.rightPartial;
    }
}
Now the key bit: the RecursiveTask that computes the result. For small problems (less than 64 characters), it calls computeDirectly() to calculate the result in a single thread; for larger problems, it splits into two, solves the two sub-problems in separate threads, and then combines the results.
private class SumForkTask extends RecursiveTask<SumTaskResult> {
    private byte buf[];
    // startPos inclusive, endPos exclusive
    private int startPos;
    private int endPos;
    public SumForkTask(byte buf[], int startPos, int endPos) {
        this.buf = buf;
        this.startPos = startPos;
        this.endPos = endPos;
    }
    private SumTaskResult computeDirectly() {
        SumTaskResult result = new SumTaskResult();
        int pos = startPos;
        result.leftMulCount = 1;
        while ((buf[pos] >= 48) && (buf[pos] <= 57)) {
            result.leftPartial = result.leftPartial * 10 + buf[pos] - 48;
            result.leftMulCount *= 10;
            pos++;
        }
        int acc = 0;
        for (int i = pos; i < endPos; i++)
            if ((buf[i] >= 48) && (buf[i] <= 57))
                acc = acc * 10 + buf[i] - 48;
            else {
                result.subtotal += acc;
                acc = 0;
            }
        result.rightPartial = acc;
        return result;
    }
    @Override
    protected SumTaskResult compute() {
        if (endPos - startPos < 64)
            return computeDirectly();
        int mid = (endPos + startPos) / 2;
        SumForkTask left = new SumForkTask(buf, startPos, mid);
        left.fork();
        SumForkTask right = new SumForkTask(buf, mid, endPos);
        SumTaskResult rRes = right.compute();
        SumTaskResult lRes = left.join();
        lRes.append(rRes);
        return lRes;
    }
}
Note that this is operating on a byte[], rather than the whole MappedByteBuffer. The reason for that is that we want to keep the disk access sequential. We'll take quite large chunks, fork/join, and then move to the next chunk.
Here's the method that does that. Note that we've pushed the buffer size up to 1MB (sub-optimal earlier, but more sensible here, it seems).
private long sumBinaryForwardMapForked() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    ForkJoinPool pool = new ForkJoinPool();
    byte buf[] = new byte[1 * 1024 * 1024];
    final FileChannel ch = raf.getChannel();
    int fileLength = (int) ch.size();
    final MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, 0,
            fileLength);
    SumTaskResult result = new SumTaskResult();
    while (mb.hasRemaining()) {
        int len = Math.min(mb.remaining(), buf.length);
        mb.get(buf, 0, len);
        SumForkTask task = new SumForkTask(buf, 0, len);
        result.append(pool.invoke(task));
    }
    ch.close();
    raf.close();
    pool.shutdown();
    return result.subtotal;
}
Now here's the soul-destroying disappointment: this nicely multi-threaded code now takes 32.2 seconds. Why so slow? I spent quite a while debugging this, assuming I'd done something terribly wrong.
Turns out there was just one small tweak needed. I'd thought the threshold of 64 between small problem and big problem was a reasonable one; turns out that was totally ridiculous.
Think about it like this. The sub-problems are exactly the same size, so they should complete in pretty much the same time. So there's really no point splitting into more pieces than there are processors available. On the machine I'm using, with only two cores, going down to a threshold of 64 is ridiculous: it just adds more overhead.
Now you don't want to limit things so that it only uses two cores even when there are more available. Perhaps the right thing to do would be to find out the number of processors at runtime, and split into that many pieces.
In any case, if I change the threshold to 512KB (half the buffer size), it now completes in 13.3 seconds. Going down to 128KB or 64KB would allow more cores to be used (up to 8 or 16 respectively), and doesn't significantly affect the runtime.
So multi-threading does make a big difference.
It's been quite a long journey, but we started out with something that took 92.9 seconds and we're now down to 13.3 seconds... that's seven times the speed of the original code. And that's not by improving the asymptotic (big-Oh) time complexity, which was linear (optimal) right from the start... this has all been about improving the constant factor.
A good day's work.
I suppose I should probably try using the GPU next...
I generated the random numbers with the following code, which I ran and redirected to a file. Obviously I can't guarantee that you'll end up with exactly the same random numbers that I had :)
public static void genRandoms() {
    Random r = new Random();
    for (int i = 0; i < 100000000; i++)
        System.out.println(r.nextInt(1000000000));
}
Old*_*eon 11
你的主要瓶颈是文件IO.解析和累加数字不应该对算法有所贡献,因为可以在文件I/O等待磁盘时在单独的线程中完成.
几年前,我研究了如何以最快的方式从文件中读取文件并遇到了一些很好的建议 - 我将其作为扫描例程实现如下:
// 4k buffer size.
static final int SIZE = 4 * 1024;
static byte[] buffer = new byte[SIZE];
// Fastest because a FileInputStream has an associated channel.
private static void ScanDataFile(Hunter p, FileInputStream f) throws FileNotFoundException, IOException {
    // Use a mapped and buffered stream for best speed.
    // See: http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly
    final FileChannel ch = f.getChannel();
    long red = 0L;
    do {
        final long read = Math.min(Integer.MAX_VALUE, ch.size() - red);
        final MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, red, read);
        int nGet;
        while (mb.hasRemaining() && p.ok()) {
            nGet = Math.min(mb.remaining(), SIZE);
            mb.get(buffer, 0, nGet);
            for (int i = 0; i < nGet && p.ok(); i++) {
                p.check(buffer[i]);
                //size += 1;
            }
        }
        red += read;
    } while (red < ch.size() && p.ok());
    // Finish off.
    p.close();
    ch.close();
    f.close();
}
您可能希望在测试速度之前调整此技术,因为它正在使用一个名为a的接口对象Hunter来搜索数据.
正如您所看到的那样,这些建议是在2008年推出的,从那以后Java已经有了很多改进,所以这可能无法提供改进.
我没有对此进行测试,但这应该适合您的测试并使用相同的技术:
class Summer {
    long sum = 0;
    long val = 0;
    public void add(byte b) {
        if (b >= '0' && b <= '9') {
            val = (val * 10) + (b - '0');
        } else {
            sum += val;
            val = 0;
        }
    }
    public long getSum() {
        return sum + val;
    }
}
private long sumMapped() throws IOException {
    Summer sum = new Summer();
    FileInputStream f = new FileInputStream(file);
    final FileChannel ch = f.getChannel();
    long red = 0L;
    do {
        final long read = Math.min(Integer.MAX_VALUE, ch.size() - red);
        final MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, red, read);
        int nGet;
        while (mb.hasRemaining()) {
            nGet = Math.min(mb.remaining(), SIZE);
            mb.get(buffer, 0, nGet);
            for (int i = 0; i < nGet; i++) {
                sum.add(buffer[i]);
            }
        }
        red += read;
    } while (red < ch.size());
    // Finish off.
    ch.close();
    f.close();
    return sum.getSum();
}
为什么这么快?
创建一个String比一个小数学要昂贵得多.
我们可以使用MappedByteBuffer帮助做得更好吗?
一点点,是的.它是我用的.它将内存保存到内存副本.即不需要byte [].
我有一种感觉,调用从缓冲区读取方法的开销会降低速度,
如果它们很简单,那么这些方法就会被内联.
特别是从缓冲区向后读时.
它不会更慢,实际上解析前进更简单/更快,因为你使用一个*而不是两个.
读取文件是向前而不是向后更好,但是仍然向后扫描缓冲区?
我不明白为什么你需要向后阅读.
想法是你读取文件的第一个块,然后向后扫描,但最后丢弃半个数字.然后,当您读取下一个块时,设置偏移量,以便从您丢弃的数字的开头读取.
听起来不必要的复杂.我会一次性读取整个文件中的内存映射.除非文件大小超过2 GB,否则无需使用块.即便如此,我会一次性阅读.
还有什么我没有想到的可以产生重大影响的东西吗?
如果数据在磁盘缓存中,它将比其他任何东西产生更大的差异.
我认为还有另一种方法可以做到这一点。
这是经典的多进程编程问题。C语言中有一个MPI库可以解决此类问题。
它的想法是将整数列表分块,例如分成 4 个部分,每个部分通过不同的过程求和。完成后,流程汇总在一起。
在 java 中,这可以通过线程(伪并行)和 java 并发来完成。
例如,4 个不同的线程对列表的 4 个不同部分求和。最后将它们总结在一起。
电话公司使用网格计算机执行这种并行编程技术来汇总其交易。
这里唯一的问题(瓶颈)是IO操作。读取该文件将花费很多时间。如果以某种方式你可以让多个线程读取文件的不同部分...这是非常复杂的方法,我认为这不会有多大好处,因为磁盘不会仅仅因为它被许多线程使用而旋转得更快,但是有做类似事情的其他技术。您可以在此处阅读有关此内容的更多信息:通过多线程访问文件和此处使用多线程读取单个文件:应该加速吗?
| 归档时间: | 
 | 
| 查看次数: | 6703 次 | 
| 最近记录: |