从大型csv文件C#.Net中删除重复记录

San*_*rst 3 .net c# csv

我已经创建了一个解决方案,它读取当前大小为20-30 mb的大型csv文件,我试图根据用户在运行时选择的某些列值删除重复的行,使用通常的查找重复行的技术但是看起来程序似乎根本不起作用.

可以应用什么其他技术从csv文件中删除重复记录

这是代码,绝对是我做错了

DataTable dtCSV = ReadCsv(file, columns);
//columns is a list of string List column
DataTable dt=RemoveDuplicateRecords(dtCSV, columns);

private DataTable RemoveDuplicateRecords(DataTable dtCSV, List<string> columns)
        {
            DataView dv = dtCSV.DefaultView;
            string RowFilter=string.Empty;

            if(dt==null)
            dt = dv.ToTable().Clone();

            DataRow row = dtCSV.Rows[0];
            foreach (DataRow row in dtCSV.Rows)
            {
                try
                {
                    RowFilter = string.Empty;

                    foreach (string column in columns)
                    {
                        string col = column;
                        RowFilter += "[" + col + "]" + "='" + row[col].ToString().Replace("'","''") + "' and ";
                    }
                    RowFilter = RowFilter.Substring(0, RowFilter.Length - 4);
                    dv.RowFilter = RowFilter;
                    DataRow dr = dt.NewRow();
                    bool result = RowExists(dt, RowFilter);
                    if (!result)
                    {
                        dr.ItemArray = dv.ToTable().Rows[0].ItemArray;
                        dt.Rows.Add(dr);

                    }

                }
                catch (Exception ex)
                {
                }
            }
            return dt;
        }

Jim*_*hel 7

一种方法是遍历表,构建一个HashSet<string>包含您感兴趣的组合列值.如果您尝试添加已经存在的字符串,那么您将有一个重复的行.就像是:

HashSet<string> ScannedRecords = new HashSet<string>();

foreach (var row in dtCSV.Rows)
{
    // Build a string that contains the combined column values
    StringBuilder sb = new StringBuilder();
    foreach (string col in columns)
    {
        sb.AppendFormat("[{0}={1}]", col, row[col].ToString());
    }

    // Try to add the string to the HashSet.
    // If Add returns false, then there is a prior record with the same values 
    if (!ScannedRecords.Add(sb.ToString())
    {
        // This record is a duplicate.
    }
}
Run Code Online (Sandbox Code Playgroud)

那应该非常快.