Java读取大文本文件具有7000万行文本

zwa*_*ang 37 java io

我有一个包含7000万行文本的大型测试文件.我必须逐行阅读文件.

我使用了两种不同的方法:

InputStreamReader isr = new InputStreamReader(new FileInputStream(FilePath),"unicode");
BufferedReader br = new BufferedReader(isr);
while((cur=br.readLine()) != null);
Run Code Online (Sandbox Code Playgroud)

LineIterator it = FileUtils.lineIterator(new File(FilePath), "unicode");
while(it.hasNext()) cur=it.nextLine();
Run Code Online (Sandbox Code Playgroud)

还有另一种方法可以使这项任务更快吗?

最好的祝福,

Evg*_*eev 42

1)我确信速度没有区别,都在内部使用FileInputStream和缓冲

2)您可以自己进行测量并查看

3)虽然没有性能优点我喜欢1.7方法

try (BufferedReader br = Files.newBufferedReader(Paths.get("test.txt"), StandardCharsets.UTF_8)) {
    for (String line = null; (line = br.readLine()) != null;) {
        //
    }
}
Run Code Online (Sandbox Code Playgroud)

4)基于扫描仪的版本

    try (Scanner sc = new Scanner(new File("test.txt"), "UTF-8")) {
        while (sc.hasNextLine()) {
            String line = sc.nextLine();
        }
        // note that Scanner suppresses exceptions
        if (sc.ioException() != null) {
            throw sc.ioException();
        }
    }
Run Code Online (Sandbox Code Playgroud)

5)这可能比其他更快

try (SeekableByteChannel ch = Files.newByteChannel(Paths.get("test.txt"))) {
    ByteBuffer bb = ByteBuffer.allocateDirect(1000);
    for(;;) {
        StringBuilder line = new StringBuilder();
        int n = ch.read(bb);
        // add chars to line
        // ...
    }
}
Run Code Online (Sandbox Code Playgroud)

它需要一些编码,但由于ByteBuffer.allocateDirect,它可以真的更快.它允许操作系统直接从文件读取字节到ByteBuffer,无需复制

6)并行处理肯定会提高速度.创建一个大的字节缓冲区,运行几个任务,将文件中的字节读取到并行缓冲区中,当准备好找到第一行结束时,创建一个String,找到下一个...


Jay*_*han 8

如果你正在关注性能,你可以看看这些java.nio.*包 - 那些应该比它快java.io.*

  • 它们对于二进制文件来说更快,但对于文本来说,它们可能比有用的更麻烦. (8认同)

Jay*_*Jay 7

在 Java 8 中,对于现在想要逐行读取文件大文件的人来说,

Stream<String> lines = Files.lines(Paths.get("c:\myfile.txt"));
lines.forEach(l -> {
  // Do anything line by line   
});
Run Code Online (Sandbox Code Playgroud)


Par*_*ade 6

有一篇文章以不同的方式阅读文件.它将帮助您找到最佳解决方案.

文档: Java技巧:如何快速读取文件

  • Downvote:链接无效。这就是为什么`StackOverflow`不想链接答案的原因。 (3认同)

max*_*max 5

我有一个类似的问题,但是我只需要文件中的字节。我通读了各种答案中提供的链接,并最终尝试在Evgeniy的答案中写一个类似于#5的链接。他们不是在开玩笑,而是花了很多代码。

基本前提是每行文本的长度都是未知的。我将从SeekableByteChannel开始,将数据读入ByteBuffer,然后在其上循环以查找EOL。当某些内容在循环之间“传递”时,它会增加一个计数器,然后最终将SeekableByteChannel位置移动并读取整个缓冲区。

它很冗长...但是有效。它可以快速满足我的需求,但是我敢肯定,还有更多可以改进的地方。

处理方法被简化为开始读取文件的基础。

private long startOffset;
private long endOffset;
private SeekableByteChannel sbc;

private final ByteBuffer buffer = ByteBuffer.allocateDirect(1024);

public void process() throws IOException
{
    startOffset = 0;
    sbc = Files.newByteChannel(FILE, EnumSet.of(READ));
    byte[] message = null;
    while((message = readRecord()) != null)
    {
        // do something
    }
}

public byte[] readRecord() throws IOException
{
    endOffset = startOffset;

    boolean eol = false;
    boolean carryOver = false;
    byte[] record = null;

    while(!eol)
    {
        byte data;
        buffer.clear();
        final int bytesRead = sbc.read(buffer);

        if(bytesRead == -1)
        {
            return null;
        }

        buffer.flip();

        for(int i = 0; i < bytesRead && !eol; i++)
        {
            data = buffer.get();
            if(data == '\r' || data == '\n')
            {
                eol = true;
                endOffset += i;

                if(carryOver)
                {
                    final int messageSize = (int)(endOffset - startOffset);
                    sbc.position(startOffset);

                    final ByteBuffer tempBuffer = ByteBuffer.allocateDirect(messageSize);
                    sbc.read(tempBuffer);
                    tempBuffer.flip();

                    record = new byte[messageSize];
                    tempBuffer.get(record);
                }
                else
                {
                    record = new byte[i];

                    // Need to move the buffer position back since the get moved it forward
                    buffer.position(0);
                    buffer.get(record, 0, i);
                }

                // Skip past the newline characters
                if(isWindowsOS())
                {
                    startOffset = (endOffset + 2);
                }
                else
                {
                    startOffset = (endOffset + 1);
                }

                // Move the file position back
                sbc.position(startOffset);
            }
        }

        if(!eol && sbc.position() == sbc.size())
        {
            // We have hit the end of the file, just take all the bytes
            record = new byte[bytesRead];
            eol = true;
            buffer.position(0);
            buffer.get(record, 0, bytesRead);
        }
        else if(!eol)
        {
            // The EOL marker wasn't found, continue the loop
            carryOver = true;
            endOffset += bytesRead;
        }
    }

    // System.out.println(new String(record));
    return record;
}
Run Code Online (Sandbox Code Playgroud)


Har*_*ish 5

在此处输入图片说明实际上,我在空闲时间对这个主题进行了几个月的研究,并提出了一个基准测试,这里是一个代码,用于对逐行读取文件的所有不同方式进行基准测试。个人性能可能因底层系统而异。我在 Windows 10 Java 8 Intel i5 HP 笔记本电脑上运行:这是代码。

import java.io.*;
import java.nio.channels.Channels;
import java.nio.channels.FileChannel;
import java.nio.file.Files;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.regex.Pattern;
import java.util.stream.Stream;

public class ReadComplexDelimitedFile {
    private static long total = 0;
    private static final Pattern FIELD_DELIMITER_PATTERN = Pattern.compile("\\^\\|\\^");

    @SuppressWarnings("unused")
    private void readFileUsingScanner() {

        String s;
        try (Scanner stdin = new Scanner(new File(this.getClass().getResource("input.txt").getPath()))) {
            while (stdin.hasNextLine()) {
                s = stdin.nextLine();
                String[] fields = FIELD_DELIMITER_PATTERN.split(s, 0);
                total = total + fields.length;
            }
        } catch (Exception e) {
            System.err.println("Error");
        }

    }

    //Winner
    private void readFileUsingCustomBufferedReader() {

        try (CustomBufferedReader stdin = new CustomBufferedReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) {
            String s;
            while ((s = stdin.readLine()) != null) {
                String[] fields = FIELD_DELIMITER_PATTERN.split(s, 0);
                total += fields.length;
            }
        } catch (Exception e) {
            System.err.println("Error");
        }

    }


    private void readFileUsingBufferedReader() {

        try (BufferedReader stdin = new BufferedReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) {
            String s;
            while ((s = stdin.readLine()) != null) {
                String[] fields = FIELD_DELIMITER_PATTERN.split(s, 0);
                total += fields.length;
            }
        } catch (Exception e) {
            System.err.println("Error");
        }
    }

    private void readFileUsingLineReader() {

        try (LineNumberReader stdin = new LineNumberReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) {
            String s;
            while ((s = stdin.readLine()) != null) {
                String[] fields = FIELD_DELIMITER_PATTERN.split(s, 0);
                total += fields.length;
            }
        } catch (Exception e) {
            System.err.println("Error");
        }
    }

    private void readFileUsingStreams() {

        try (Stream<String> stream = Files.lines((new File(this.getClass().getResource("input.txt").getPath())).toPath())) {
            total += stream.mapToInt(s -> FIELD_DELIMITER_PATTERN.split(s, 0).length).sum();
        } catch (IOException e1) {
            e1.printStackTrace();
        }
    }


    private void readFileUsingBufferedReaderFileChannel() {
        try (FileInputStream fis = new FileInputStream(this.getClass().getResource("input.txt").getPath())) {
            try (FileChannel inputChannel = fis.getChannel()) {
                try (CustomBufferedReader stdin = new CustomBufferedReader(Channels.newReader(inputChannel, "UTF-8"))) {
                    String s;
                    while ((s = stdin.readLine()) != null) {
                        String[] fields = FIELD_DELIMITER_PATTERN.split(s, 0);
                        total = total + fields.length;
                    }
                }
            } catch (Exception e) {
                System.err.println("Error");
            }
        } catch (Exception e) {
            System.err.println("Error");
        }

    }


    public static void main(String args[]) {
        //JVM wamrup
        for (int i = 0; i < 100000; i++) {
            total += i;
        }
        // We know scanner is slow-Still warming up
        ReadComplexDelimitedFile readComplexDelimitedFile = new ReadComplexDelimitedFile();
        List<Long> longList = new ArrayList<>(50);
        for (int i = 0; i < 50; i++) {
            total = 0;
            long startTime = System.nanoTime();
            //readComplexDelimitedFile.readFileUsingScanner();
            long stopTime = System.nanoTime();
            long timeDifference = stopTime - startTime;
            longList.add(timeDifference);

        }
        System.out.println("Time taken for readFileUsingScanner");
        longList.forEach(System.out::println);
        // Actual performance test starts here

        longList = new ArrayList<>(10);
        for (int i = 0; i < 10; i++) {
            total = 0;
            long startTime = System.nanoTime();
            readComplexDelimitedFile.readFileUsingBufferedReaderFileChannel();
            long stopTime = System.nanoTime();
            long timeDifference = stopTime - startTime;
            longList.add(timeDifference);

        }
        System.out.println("Time taken for readFileUsingBufferedReaderFileChannel");
        longList.forEach(System.out::println);
        longList.clear();
        for (int i = 0; i < 10; i++) {
            total = 0;
            long startTime = System.nanoTime();
            readComplexDelimitedFile.readFileUsingBufferedReader();
            long stopTime = System.nanoTime();
            long timeDifference = stopTime - startTime;
            longList.add(timeDifference);

        }
        System.out.println("Time taken for readFileUsingBufferedReader");
        longList.forEach(System.out::println);
        longList.clear();
        for (int i = 0; i < 10; i++) {
            total = 0;
            long startTime = System.nanoTime();
            readComplexDelimitedFile.readFileUsingStreams();
            long stopTime = System.nanoTime();
            long timeDifference = stopTime - startTime;
            longList.add(timeDifference);

        }
        System.out.println("Time taken for readFileUsingStreams");
        longList.forEach(System.out::println);
        longList.clear();
        for (int i = 0; i < 10; i++) {
            total = 0;
            long startTime = System.nanoTime();
            readComplexDelimitedFile.readFileUsingCustomBufferedReader();
            long stopTime = System.nanoTime();
            long timeDifference = stopTime - startTime;
            longList.add(timeDifference);

        }
        System.out.println("Time taken for readFileUsingCustomBufferedReader");
        longList.forEach(System.out::println);
        longList.clear();
        for (int i = 0; i < 10; i++) {
            total = 0;
            long startTime = System.nanoTime();
            readComplexDelimitedFile.readFileUsingLineReader();
            long stopTime = System.nanoTime();
            long timeDifference = stopTime - startTime;
            longList.add(timeDifference);

        }
        System.out.println("Time taken for readFileUsingLineReader");
        longList.forEach(System.out::println);

    }
}
Run Code Online (Sandbox Code Playgroud)

我不得不重写 BufferedReader 以避免同步和一些不需要的边界条件。(至少这是我的感觉。它没有经过单元测试,所以使用它需要你自担风险。)

import com.sun.istack.internal.NotNull;

import java.io.*;
import java.util.Iterator;
import java.util.NoSuchElementException;
import java.util.Spliterator;
import java.util.Spliterators;
import java.util.concurrent.locks.ReadWriteLock;
import java.util.concurrent.locks.ReentrantReadWriteLock;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;

/**
 * Reads text from a character-input stream, buffering characters so as to
 * provide for the efficient reading of characters, arrays, and lines.
 * <p>
 * <p> The buffer size may be specified, or the default size may be used.  The
 * default is large enough for most purposes.
 * <p>
 * <p> In general, each read request made of a Reader causes a corresponding
 * read request to be made of the underlying character or byte stream.  It is
 * therefore advisable to wrap a CustomBufferedReader around any Reader whose read()
 * operations may be costly, such as FileReaders and InputStreamReaders.  For
 * example,
 * <p>
 * <pre>
 * CustomBufferedReader in
 *   = new CustomBufferedReader(new FileReader("foo.in"));
 * </pre>
 * <p>
 * will buffer the input from the specified file.  Without buffering, each
 * invocation of read() or readLine() could cause bytes to be read from the
 * file, converted into characters, and then returned, which can be very
 * inefficient.
 * <p>
 * <p> Programs that use DataInputStreams for textual input can be localized by
 * replacing each DataInputStream with an appropriate CustomBufferedReader.
 *
 * @author Mark Reinhold
 * @see FileReader
 * @see InputStreamReader
 * @see java.nio.file.Files#newBufferedReader
 * @since JDK1.1
 */

public class CustomBufferedReader extends Reader {

    private final Reader in;

    private char cb[];
    private int nChars, nextChar;

    private static final int INVALIDATED = -2;
    private static final int UNMARKED = -1;
    private int markedChar = UNMARKED;
    private int readAheadLimit = 0; /* Valid only when markedChar > 0 */

    /**
     * If the next character is a line feed, skip it
     */
    private boolean skipLF = false;

    /**
     * The skipLF flag when the mark was set
     */
    private boolean markedSkipLF = false;

    private static int defaultCharBufferSize = 8192;
    private static int defaultExpectedLineLength = 80;
    private ReadWriteLock rwlock;


    /**
     * Creates a buffering character-input stream that uses an input buffer of
     * the specified size.
     *
     * @param in A Reader
     * @param sz Input-buffer size
     * @throws IllegalArgumentException If {@code sz <= 0}
     */
    public CustomBufferedReader(@NotNull final Reader in, int sz) {
        super(in);
        if (sz <= 0)
            throw new IllegalArgumentException("Buffer size <= 0");
        this.in = in;
        cb = new char[sz];
        nextChar = nChars = 0;
        rwlock = new ReentrantReadWriteLock();
    }

    /**
     * Creates a buffering character-input stream that uses a default-sized
     * input buffer.
     *
     * @param in A Reader
     */
    public CustomBufferedReader(@NotNull final Reader in) {
        this(in, defaultCharBufferSize);
    }


    /**
     * Fills the input buffer, taking the mark into account if it is valid.
     */
    private void fill() throws IOException {
        int dst;
        if (markedChar <= UNMARKED) {
            /* No mark */
            dst = 0;
        } else {
            /* Marked */
            int delta = nextChar - markedChar;
            if (delta >= readAheadLimit) {
                /* Gone past read-ahead limit: Invalidate mark */
                markedChar = INVALIDATED;
                readAheadLimit = 0;
                dst = 0;
            } else {
                if (readAheadLimit <= cb.length) {
                    /* Shuffle in the current buffer */
                    System.arraycopy(cb, markedChar, cb, 0, delta);
                    markedChar = 0;
                    dst = delta;
                } else {
                    /* Reallocate buffer to accommodate read-ahead limit */
                    char ncb[] = new char[readAheadLimit];
                    System.arraycopy(cb, markedChar, ncb, 0, delta);
                    cb = ncb;
                    markedChar = 0;
                    dst = delta;
                }
                nextChar = nChars = delta;
            }
        }

        int n;
        do {
            n = in.read(cb, dst, cb.length - dst);
        } while (n == 0);
        if (n > 0) {
            nChars = dst + n;
            nextChar = dst;
        }
    }

    /**
     * Reads a single character.
     *
     * @return The character read, as an integer in the range
     * 0 to 65535 (<tt>0x00-0xffff</tt>), or -1 if the
     * end of the stream has been reached
     * @throws IOException If an I/O error occurs
     */
    public char readChar() throws IOException {
        for (; ; ) {
            if (nextChar >= nChars) {
                fill();
                if (nextChar >= nChars)
                    return (char) -1;
            }
            return cb[nextChar++];
        }
    }

    /**
     * Reads characters into a portion of an array, reading from the underlying
     * stream if necessary.
     */
    private int read1(char[] cbuf, int off, int len) throws IOException {
        if (nextChar >= nChars) {
            /* If the requested length is at least as large as the buffer, and
               if there is no mark/reset activity, and if line feeds are not
               being skipped, do not bother to copy the characters into the
               local buffer.  In this way buffered streams will cascade
               harmlessly. */
            if (len >= cb.length && markedChar <= UNMARKED && !skipLF) {
                return in.read(cbuf, off, len);
            }
            fill();
        }
        if (nextChar >= nChars) return -1;
        int n = Math.min(len, nChars - nextChar);
        System.arraycopy(cb, nextChar, cbuf, off, n);
        nextChar += n;
        return n;
    }

    /**
     * Reads characters into a portion of an array.
     * <p>
     * <p> This method implements the general contract of the corresponding
     * <code>{@link Reader#read(char[], int, int) read}</code> method of the
     * <code>{@link Reader}</code> class.  As an additional convenience, it
     * attempts to read as many characters as possible by repeatedly invoking
     * the <code>read</code> method of the underlying stream.  This iterated
     * <code>read</code> continues until one of the following conditions becomes
     * true: <ul>
     * <p>
     * <li> The specified number of characters have been read,
     * <p>
     * <li> The <code>read</code> method of the underlying stream returns
     * <code>-1</code>, indicating end-of-file, or
     * <p>
     * <li> The <code>ready</code> method of the underlying stream
     * returns <code>false</code>, indicating that further input requests
     * would block.
     * <p>
     * </ul> If the first <code>read</code> on the underlying stream returns
     * <code>-1</code> to indicate end-of-file then this method returns
     * <code>-1</code>.  Otherwise this method returns the number of characters
     * actually read.
     * <p>
     * <p> Subclasses of this class are encouraged, but not required, to
     * attempt to read as many characters as possible in the same fashion.
     * <p>
     * <p> Ordinarily this method takes characters from this stream's character
     * buffer, filling it from the underlying stream as necessary.  If,
     * however, the buffer is empty, the mark is not valid, and the requested
     * length is at least as large as the buffer, then this method will read
     * characters directly from the underlying stream into the given array.
     * Thus redundant <code>CustomBufferedReader</code>s will not copy data
     * unnecessarily.
     *
     * @param cbuf Destination buffer
     * @param off  Offset at which to start storing characters
     * @param len  Maximum number of characters to read
     * @return The number of characters read, or -1 if the end of the
     * stream has been reached
     * @throws IOException If an I/O error occurs
     */
    public int read(char cbuf[], int off, int len) throws IOException {
        int n = read1(cbuf, off, len);
        if (n <= 0) return n;
        while ((n < len) && in.ready()) {
            int n1 = read1(cbuf, off + n, len - n);
            if (n1 <= 0) break;
            n += n1;
        }
        return n;
    }

    /**
     * Reads a line of text.  A line is considered to be terminated by any one
     * of a line feed ('\n'), a carriage return ('\r'), or a carriage return
     * followed immediately by a linefeed.
     *
     * @param ignoreLF If true, the next '\n' will be skipped
     * @return A String containing the contents of the line, not including
     * any line-termination characters, or null if the end of the
     * stream has been reached
     * @throws IOException If an I/O error occurs
     * @see java.io.LineNumberReader#readLine()
     */
    String readLine(boolean ignoreLF) throws IOException {
        StringBuilder s = null;
        int startChar;



        bufferLoop:
        for (; ; ) {

            if (nextChar >= nChars)
                fill();
            if (nextChar >= nChars) { /* EOF */
                if (s != null && s.length() > 0)
                    return s.toString();
                else
                    return null;
            }
            boolean eol = false;
            char c = 0;
            int i;

            /* Skip a leftover '\n', if necessary */



            charLoop:
            for (i = nextChar; i < nChars; i++) {
                c = cb[i];
                if ((c == '\n')) {
                    eol = true;
                    break charLoop;
                }
            }

            startChar = nextChar;
            nextChar = i;

            if (eol) {
                String str;
                if (s == null) {
                    str = new String(cb, startChar, i - startChar);
                } else {
                    s.append(cb, startChar, i - startChar);
                    str = s.toString();
                }
                nextChar++;
                return str;
            }

            if (s == null)
                s = new StringBuilder(defaultExpectedLineLength);
            s.append(cb, startChar, i - startChar);
        }
    }

    /**
     * Reads a line of text.  A line is considered to be terminated by any one
     * of a line feed ('\n'), a carriage return ('\r'), or a carriage return
     * followed immediately by a linefeed.
     *
     * @return A String containing the contents of the line, not including
     * any line-termination characters, or null if the end of the
     * stream has been reached
     * @throws IOException If an I/O error occurs
     * @see java.nio.file.Files#readAllLines
     */
    public String readLine() throws IOException {
        return readLine(false);
    }

    /**
     * Skips characters.
     *
     * @param n The number of characters to skip
     * @return The number of characters actually skipped
     * @throws IllegalArgumentException If <code>n</code> is negative.
     * @throws IOException              If an I/O error occurs
     */
    public long skip(long n) throws IOException {
        if (n < 0L) {
            throw new IllegalArgumentException("skip value is negative");
        }
        rwlock.readLock().lock();
            long r = n;
            try{
            while (r > 0) {
                if (nextChar >= nChars)
                    fill();
                if (nextChar >= nChars) /* EOF */
                    break;
                if (skipLF) {
                    skipLF = false;
                    if (cb[nextChar] == '\n') {
                        nextChar++;
                    }
                }
                long d = nChars - nextChar;
                if (r <= d) {
                    nextChar += r;
                    r = 0;
                    break;
                } else {
                    r -= d;
                    nextChar = nChars;
                }
            }
        } finally {
            rwlock.readLock().unlock();
        }
        return n - r;

  • 我觉得这是一个关于哪个更快的彻底答案,因为它提供了所有这些方法之间的硬测量。但你必须真正阅读代码才能开始理解结果。只要多一点演示,结果就会更容易理解。但基本上对于编写非同步BufferedReader来说有5%左右的提升。我发现,只需使用默认的 LineNumberReader/BufferedReader 将缓冲区大小从 8K 增加到 64k,即可获得 40% 的改进。 (2认同)