在 C# 中读取 csv 文件以提高时间效率的最佳方法

Question

在 C# 中读取 csv 文件以提高时间效率的最佳方法

Nex*_*eer 6 c# linq parallel-processing performance system.diagnostics

我有以下代码可以读取一个大文件，比如超过一百万行。我正在使用 Parallel 和 Linq 方法。有没有更好的方法来做到这一点？如果是，那么如何？

        private static void ReadFile()
        {
            float floatTester = 0;
            List<float[]> result = File.ReadLines(@"largedata.csv")
                .Where(l => !string.IsNullOrWhiteSpace(l))
                .Select(l => new { Line = l, Fields = l.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries) })
                .Select(x => x.Fields
                              .Where(f => Single.TryParse(f, out floatTester))
                              .Select(f => floatTester).ToArray())
                .ToList();

            // now get your totals
            int numberOfLinesWithData = result.Count;
            int numberOfAllFloats = result.Sum(fa => fa.Length);
            MessageBox.Show(numberOfAllFloats.ToString());
        }

        private static readonly char[] Separators = { ',', ' ' };

        private static void ProcessFile()
        {
            var lines = File.ReadAllLines("largedata.csv");
            var numbers = ProcessRawNumbers(lines);

            var rowTotal = new List<double>();
            var totalElements = 0;

            foreach (var values in numbers)
            {
                var sumOfRow = values.Sum();
                rowTotal.Add(sumOfRow);
                totalElements += values.Count;
            }
            MessageBox.Show(totalElements.ToString());
        }

        private static List<List<double>> ProcessRawNumbers(IEnumerable<string> lines)
        {
            var numbers = new List<List<double>>();
            /*System.Threading.Tasks.*/
            Parallel.ForEach(lines, line =>
            {
                lock (numbers)
                {
                    numbers.Add(ProcessLine(line));
                }
            });
            return numbers;
        }

        private static List<double> ProcessLine(string line)
        {
            var list = new List<double>();
            foreach (var s in line.Split(Separators, StringSplitOptions.RemoveEmptyEntries))
            {
                double i;
                if (Double.TryParse(s, out i))
                {
                    list.Add(i);
                }
            }
            return list;
        }

        private void button1_Click(object sender, EventArgs e)
        {
            Stopwatch stopWatchParallel = new Stopwatch();
            stopWatchParallel.Start();
            ProcessFile();
            stopWatchParallel.Stop();
            // Get the elapsed time as a TimeSpan value.
            TimeSpan ts = stopWatchParallel.Elapsed;

            // Format and display the TimeSpan value.
            string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
                ts.Hours, ts.Minutes, ts.Seconds,
                ts.Milliseconds / 10);
            MessageBox.Show(elapsedTime);

            Stopwatch stopWatchLinQ = new Stopwatch();
            stopWatchLinQ.Start();
            ReadFile();
            stopWatchLinQ.Stop();
            // Get the elapsed time as a TimeSpan value.
            TimeSpan ts2 = stopWatchLinQ.Elapsed;

            // Format and display the TimeSpan value.
            string elapsedTimeLinQ = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
                ts2.Hours, ts.Minutes, ts.Seconds,
                ts2.Milliseconds / 10);
            MessageBox.Show(elapsedTimeLinQ);
        }

Run Code Online (Sandbox Code Playgroud)

Answer 1

Vit*_*nko 5

最近，我面临着出于相同目的尽快解析大型 CSV 文件的问题：数据聚合和指标计算（在我的例子中，最终目标是生成数据透视表）。我测试了最流行的 CSV 阅读器，但发现它们并不是为解析具有数百万行或更多行的 CSV 文件而设计的；JoshClose 的 CsvHelper 速度很快，但最终我能够以 2 到 4 倍的速度将 CSV 作为流处理！

我的方法基于两个假设：

尽可能避免创建字符串，因为这会浪费内存和 CPU（= 增加 GC 有效负载）。相反，解析器结果可以表示为一组“字段值”描述符，仅保存缓冲区中的开始和结束位置+一些元数据（带引号的值标志、值内双引号的数量），并且仅在以下情况下构造字符串值：需要。
使用循环 char[] 缓冲区读取 csv 行以避免过多的数据复制
无抽象，最少的方法调用 - 这可以实现有效的 JIT 优化（例如，避免数组长度检查）。没有 LINQ，没有迭代器 ( foreach) - 这样for效率更高。

现实生活中的使用数字（数据透视表由 200MB CSV 文件组成，17 列，仅使用 3 列来构建交叉表）：

我的自定义 CSV 阅读器：~1.9s
CsvHelper：~6.1s

- - 更新 - -

我已经在 github 上发布了按上述方式工作的库： https: //github.com/nreco/csv

Nuget包：https://www.nuget.org/packages/NReco.Csv/

@rburte 还没有，你对这个库感兴趣吗？代码很稳定，并且在我的产品 (SeekTable) 的 prod env 中运行良好，因此如果其他人也需要这个超快速/内存高效的 CSV 解析器，我可以将其发布到 github / nuget 上。 (2认同)

Answer 2

Rob*_*ben 3

您可以使用内置的 OleDb 来实现这一点。

public void ImportCsvFile(string filename)
{
    FileInfo file = new FileInfo(filename);

    using (OleDbConnection con = 
            new OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0;Data Source=\"" +
            file.DirectoryName + "\";
            Extended Properties='text;HDR=Yes;FMT=Delimited(,)';"))
    {
        using (OleDbCommand cmd = new OleDbCommand(string.Format
                                  ("SELECT * FROM [{0}]", file.Name), con))
        {
            con.Open();

            // Using a DataTable to process the data
            using (OleDbDataAdapter adp = new OleDbDataAdapter(cmd))
            {
                DataTable tbl = new DataTable("MyTable");
                adp.Fill(tbl);

                //foreach (DataRow row in tbl.Rows)

                //Or directly make a list
                List<DataRow> list = dt.AsEnumerable().ToList();
            }
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

请参阅此和此以供进一步参考。

归档时间：	13 年，4 月前
查看次数：	14269 次
最近记录：	7 年前