读取映射到内存的CSV文件的最简单方法？

Question

读取映射到内存的CSV文件的最简单方法？

use*_*112 3 c++ csv io boost memory-mapped-files

当我从C ++（11）中读取文件时，我使用以下命令将它们映射到内存中：

boost::interprocess::file_mapping* fm = new file_mapping(path, boost::interprocess::read_only);
boost::interprocess::mapped_region* region = new mapped_region(*fm, boost::interprocess::read_only);
char* bytes = static_cast<char*>(region->get_address());

Run Code Online (Sandbox Code Playgroud)

当我希望非常快地逐字节读取时，这很好。但是，我创建了一个csv文件，该文件要映射到内存，读取每一行并在逗号上分割每一行。

是否可以通过对上面的代码进行一些修改来做到这一点？

（我正在映射到内存，因为我有很多内存，并且我不希望磁盘/ IO流出现任何瓶颈）。

Answer 1

seh*_*ehe 5

这是我对“足够快”的看法。它在约1秒钟内浏览了116 MiB CSV（2.5Mio行^[1]）。

然后可以从零拷贝处随机访问结果，因此没有开销（除非换出了页面）。

为了比较：
这比朴素的同一个文件快3倍wc csv.txt
它的速度大约与以下perl一线（列出所有行的不同字段计数）一样快：
perl -ne '$fields{scalar split /,/}++; END { map { print "$_\n" } keys %fields  }' csv.txt
Run Code Online (Sandbox Code Playgroud)
它只会比(LANG=C wc csv.txt)避免区域设置功能的速度慢（大约1.5倍）

这是所有荣耀的解析器：

using CsvField = boost::string_ref;
using CsvLine  = std::vector<CsvField>;
using CsvFile  = std::vector<CsvLine>;  // keep it simple :)

struct CsvParser : qi::grammar<char const*, CsvFile()> {
    CsvParser() : CsvParser::base_type(lines)
    {
        using namespace qi;

        field = raw [*~char_(",\r\n")] 
            [ _val = construct<CsvField>(begin(_1), size(_1)) ]; // semantic action
        line  = field % ',';
        lines = line  % eol;
    }
    // declare: line, field, fields
};

Run Code Online (Sandbox Code Playgroud)

唯一棘手的事情（也是唯一的优化方法）是CsvField从源迭代器构造具有匹配字符数的a的语义动作。

这是主要的：

int main()
{
    boost::iostreams::mapped_file_source csv("csv.txt");

    CsvFile parsed;
    if (qi::parse(csv.data(), csv.data() + csv.size(), CsvParser(), parsed))
    {
        std::cout << (csv.size() >> 20) << " MiB parsed into " << parsed.size() << " lines of CSV field values\n";
    }
}

Run Code Online (Sandbox Code Playgroud)

列印

116 MiB parsed into 2578421 lines of CSV values

Run Code Online (Sandbox Code Playgroud)

您可以像使用这些值一样std::string：

for (int i = 0; i < 10; ++i)
{
    auto l     = rand() % parsed.size();
    auto& line = parsed[l];
    auto c     = rand() % line.size();

    std::cout << "Random field at L:" << l << "\t C:" << c << "\t" << line[c] << "\n";
}

Run Code Online (Sandbox Code Playgroud)

哪个打印，例如：

Random field at L:1979500    C:2    sateen's
Random field at L:928192     C:1    sackcloth's
Random field at L:1570275    C:4    accompanist's
Random field at L:479916     C:2    apparel's
Random field at L:767709     C:0    pinks
Random field at L:1174430    C:4    axioms
Random field at L:1209371    C:4    wants
Random field at L:2183367    C:1    Klondikes
Random field at L:2142220    C:1    Anthony
Random field at L:1680066    C:2    pines

Run Code Online (Sandbox Code Playgroud)

完整的工作示例在此处Live On Coliru

^[1]我通过重复附加输出来创建文件

while read a && read b && read c && read d && read e
do echo "$a,$b,$c,$d,$e"
done < /etc/dictionaries-common/words

Run Code Online (Sandbox Code Playgroud)

到csv.txt，直到计数到250万行。

归档时间：	11 年，6 月前
查看次数：	1914 次
最近记录：	11 年，6 月前