在 C++ 中读取包含大量列和行的 csv 文件的最快方法

Question

在 C++ 中读取包含大量列和行的 csv 文件的最快方法

use*_*538 5 c++ string optimization performance vector

我有一个超过 13 列的竖线分隔数据文件。总文件大小超过 100 MB。我正在阅读每一行，将字符串拆分为 astd::vector<std::string>以便我可以进行计算。我对文件中的所有行重复此过程，如下所示：

    string filename = "file.dat";
    fstream infile(filename);
    string line;
    while (getline(infile, line)) {
        string item;
        stringstream ss(line);
        vector<string> splittedString;
        while (getline(ss, item, '|')) {
            splittedString.push_back(item);
        }
        int a = stoi(splittedString[0]); 
        // I do some processing like this before some manipulation and calculations with the data
    }

Run Code Online (Sandbox Code Playgroud)

然而，这非常耗时，而且我很确定这不是读取 CSV 类型文件的最优化方式。如何改进？

更新

我尝试使用该boost::split函数而不是 while 循环，但实际上它更慢。

Answer 1

rus*_*tyx 5

您没有 CSV 文件，因为 CSV 代表逗号分隔值，而您没有。
您有一个分隔的文本文件（显然由分隔"|"）。解析 CSV 比简单地拆分更复杂","。

无论如何，如果您的方法没有太多戏剧性的变化，这里有一些建议：

使用（更多）缓冲
将vector退出循环和clear()它在每次迭代。这将节省堆重新分配。
使用string::find()而不是stringstream分割字符串。

像这样的东西...

using namespace std;
int main() {
    string filename = "file.dat";
    fstream infile(filename);
    char buffer[65536];
    infile.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
    string line;
    vector<string> splittedString;
    while (getline(infile, line)) {
        splittedString.clear();
        size_t last = 0, pos = 0;
        while ((pos = line.find('|', last)) != std::string::npos) {
            splittedString.emplace_back(line, last, pos - last);
            last = pos + 1;
        }
        if (last)
            splittedString.emplace_back(line, last);
        int a = stoi(splittedString[0]);
        // I do some processing like this before some manipulation and calculations with the data
    }
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，6 月前
查看次数：	1212 次
最近记录：	6 年，6 月前