适用于Java的优秀且有效的CSV/TSV Reader

Rob*_*bin 11 java csv large-files opencsv

我试图读取大CSVTSV(Tab sepperated)文件大约1000000行或更多.现在我试着读一下TSV含有的~2500000opencsv,但是它却引发了我的注意java.lang.NullPointerException.它适用于TSV带有~250000线条的较小文件.所以我想知道是否有任何其他Libraries支持阅读巨大CSVTSV文件.你有什么想法?

每个对我的代码感兴趣的人(我缩短它,所以Try-Catch显然无效):

InputStreamReader in = null;
CSVReader reader = null;
try {
    in = this.replaceBackSlashes();
    reader = new CSVReader(in, this.seperator, '\"', this.offset);
    ret = reader.readAll();
} finally {
    try {
        reader.close();
    } 
}
Run Code Online (Sandbox Code Playgroud)

编辑:这是我构建的方法InputStreamReader:

private InputStreamReader replaceBackSlashes() throws Exception {
        FileInputStream fis = null;
        Scanner in = null;
        try {
            fis = new FileInputStream(this.csvFile);
            in = new Scanner(fis, this.encoding);
            ByteArrayOutputStream out = new ByteArrayOutputStream();

            while (in.hasNext()) {
                String nextLine = in.nextLine().replace("\\", "/");
                // nextLine = nextLine.replaceAll(" ", "");
                nextLine = nextLine.replaceAll("'", "");
                out.write(nextLine.getBytes());
                out.write("\n".getBytes());
            }

            return new InputStreamReader(new ByteArrayInputStream(out.toByteArray()));
        } catch (Exception e) {
            in.close();
            fis.close();
            this.logger.error("Problem at replaceBackSlashes", e);
        }
        throw new Exception();
    }
Run Code Online (Sandbox Code Playgroud)

Jer*_*kes 13

不要使用CSV解析器来解析TSV输入.例如,如果TSV具有带引号字符的字段,它将会中断.

uniVocity-parsers附带一个TSV解析器.您可以毫无问题地解析十亿行.

解析TSV输入的示例:

TsvParserSettings settings = new TsvParserSettings();
TsvParser parser = new TsvParser(settings);

// parses all rows in one go.
List<String[]> allRows = parser.parseAll(new FileReader(yourFile));
Run Code Online (Sandbox Code Playgroud)

如果您的输入太大,则无法保存在内存中,请执行以下操作:

TsvParserSettings settings = new TsvParserSettings();

// all rows parsed from your input will be sent to this processor
ObjectRowProcessor rowProcessor = new ObjectRowProcessor() {
    @Override
    public void rowProcessed(Object[] row, ParsingContext context) {
        //here is the row. Let's just print it.
        System.out.println(Arrays.toString(row));
    }
};
// the ObjectRowProcessor supports conversions from String to whatever you need:
// converts values in columns 2 and 5 to BigDecimal
rowProcessor.convertIndexes(Conversions.toBigDecimal()).set(2, 5);

// converts the values in columns "Description" and "Model". Applies trim and to lowercase to the values in these columns.
rowProcessor.convertFields(Conversions.trim(), Conversions.toLowerCase()).set("Description", "Model");

//configures to use the RowProcessor
settings.setRowProcessor(rowProcessor);

TsvParser parser = new TsvParser(settings);
//parses everything. All rows will be pumped into your RowProcessor.
parser.parse(new FileReader(yourFile));
Run Code Online (Sandbox Code Playgroud)

披露:我是这个图书馆的作者.它是开源和免费的(Apache V2.0许可证).


Run*_*ion 6

我没有尝试过,但我之前曾调查过superCSV.

http://sourceforge.net/projects/supercsv/

http://supercsv.sourceforge.net/

检查这是否适合您,250万行.

  • @Robin作为一名超级CSV开发人员,我很高兴听到这一点,虽然对opencsv公平,如果你使用`reader.readAll()`而不是阅读每一行并做,你必然遇到(内存)问题用它的东西.当你将整个文件写入内存时,你的`replaceBackslashes()`方法也会遇到问题.关闭你的一个流/读者时你的NPE是否出现? (3认同)