我正在构建一个C++ CSV数据解析器.我正在尝试访问文件的第一列和第十五列,并使用getline命令将它们读入两个数组.例如:
for(int j=0;j<i;j++)
{
getline(posts2,postIDs[j],',');
for(int k=0;k<14;k++)
{
getline(posts2,tossout,',');
}
getline(posts2,answerIDs[j],',');
getline(posts2,tossout,'\r');
Run Code Online (Sandbox Code Playgroud)
但是,第一列和第十五列之间是一个引号括起来的列,包含各种逗号和宽松的引号.例如:
......,"abc,defghijk."Lmnopqrs,"tuv","wxyz.",... <
避免这一列的最佳方法是什么?我无法对它进行深入研究,因为它内部有引号和逗号.跑进报价之后,我应该逐个阅读引用的垃圾,直到找到", 依次?
此外,我已经看到了其他解决方案,但它们都是Windows/Visual Studio独有的.我正在运行Mac OSX ver.10.8.3与Xcode 3.2.3.
提前致谢!德鲁
CSV格式没有正式标准,但我们在开头就注意到您引用的丑陋列:
"abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",
Run Code Online (Sandbox Code Playgroud)
不符合被视为CSV 的基本规则,因为其中两个是: -
1)必须引用带有嵌入式逗号的字段.
2)每个嵌入的双引号字符必须用一对双引号字符表示.
如果问题列遵守规则1),则它不遵守规则2).但我们可以解释它以遵守规则1) - 所以我们可以说它结束的地方 - 如果我们平衡双引号,例如
[abc, defghijk. [Lmnopqrs, ]tuv,[] wxyz.],
Run Code Online (Sandbox Code Playgroud)
平衡的最外面的引号将列括起来.平衡的内部报价可能只是缺乏内部的任何其他指示,除了平衡使它们内部.
我们希望能有规则,它将分析这个文本作为一列,始终与规则1),并且还将解析列 不服从规则2)了.刚刚展出的平衡表明这可以做到,因为遵守这两个规则的列也必然是平衡的.
建议的规则是:
如果逗号之前有双引号的偶数,那么我们知道我们可以平衡封闭的引号并至少以一种方式平衡其余的引号.
您正在考虑的更简单的规则:
跑进报价之后,我应该逐个阅读引用的垃圾,直到找到",依次?
如果它与某些列的满足将会失败做遵守规则2),如
"超级","豪华","卡车",
更简单的规则将在之后终止列""luxurious"".但由于此列符合规则2),相邻的双引号是"转义"双引号,没有分界意义.另一方面,建议的规则仍然正确地解析列,之后终止它truck".
这是一个演示程序,其中函数get_csv_column按建议的规则解析列:
#include <iostream>
#include <fstream>
#include <cstdlib>
using namespace std;
/*
Assume `in` is positioned at start of column.
Accumulates chars from `in` as long as `in` is good
until either:-
- Have consumed a comma preceded by 0 quotes,or
- Have consumed a comma immediately preceded by
the last of an even number of quotes.
*/
std::string get_csv_column(ifstream & in)
{
std::string col;
unsigned quotes = 0;
char prev = 0;
bool finis = false;
for (int ch; !finis && (ch = in.get()) != EOF; ) {
switch(ch) {
case '"':
++quotes;
break;
case ',':
if (quotes == 0 || (prev == '"' && (quotes & 1) == 0)) {
finis = true;
}
break;
default:;
}
col += prev = ch;
}
return col;
}
int main()
{
ifstream in("csv.txt");
if (!in) {
cout << "Open error :(" << endl;
exit(EXIT_FAILURE);
}
for (std::string col; in; ) {
col = get_csv_column(in),
cout << "<[" << col << "]>" << std::endl;
}
if (!in && !in.eof()) {
cout << "Read error :(" << endl;
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}
Run Code Online (Sandbox Code Playgroud)
它包含每个列<[...]>,而不是折扣换行符,并包括每列的终端',':
该文件csv.txt是:
...,"abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",...,
",","",
Year,Make,Model,Description,Price,
1997,Ford,E350,"Super, ""luxurious"", truck",
1997,Ford,E350,"Super, ""luxurious"" truck",
1997,Ford,E350,"ac, abs, moon",3000.00,
1999,Chevy,"Venture ""Extended Edition""","",4900.00,
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00,
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00,
Run Code Online (Sandbox Code Playgroud)
输出是:
<[...,]>
<["abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",]>
<[...,]>
<[
",",]>
<["",]>
<[
Year,]>
<[Make,]>
<[Model,]>
<[Description,]>
<[Price,]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["Super, ""luxurious"", truck",]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["Super, ""luxurious"" truck",]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["ac, abs, moon",]>
<[3000.00,]>
<[
1999,]>
<[Chevy,]>
<["Venture ""Extended Edition""",]>
<["",]>
<[4900.00,]>
<[
1999,]>
<[Chevy,]>
<["Venture ""Extended Edition, Very Large""",]>
<[,]>
<[5000.00,]>
<[
1996,]>
<[Jeep,]>
<[Grand Cherokee,]>
<["MUST SELL!
air, moon roof, loaded",]>
<[4799.00]>
Run Code Online (Sandbox Code Playgroud)