我已经创建了一个解决方案,它读取当前大小为20-30 mb的大型csv文件,我试图根据用户在运行时选择的某些列值删除重复的行,使用通常的查找重复行的技术但是看起来程序似乎根本不起作用.
可以应用什么其他技术从csv文件中删除重复记录
这是代码,绝对是我做错了
DataTable dtCSV = ReadCsv(file, columns);
//columns is a list of string List column
DataTable dt=RemoveDuplicateRecords(dtCSV, columns);
private DataTable RemoveDuplicateRecords(DataTable dtCSV, List<string> columns)
{
DataView dv = dtCSV.DefaultView;
string RowFilter=string.Empty;
if(dt==null)
dt = dv.ToTable().Clone();
DataRow row = dtCSV.Rows[0];
foreach (DataRow row in dtCSV.Rows)
{
try
{
RowFilter = string.Empty;
foreach (string column in columns)
{
string col = column;
RowFilter += "[" + col + "]" + "='" + row[col].ToString().Replace("'","''") + "' and ";
}
RowFilter = RowFilter.Substring(0, RowFilter.Length - 4);
dv.RowFilter = RowFilter;
DataRow dr = dt.NewRow();
bool result = RowExists(dt, RowFilter);
if (!result)
{
dr.ItemArray = dv.ToTable().Rows[0].ItemArray;
dt.Rows.Add(dr);
}
}
catch (Exception ex)
{
}
}
return dt;
}
一种方法是遍历表,构建一个HashSet<string>包含您感兴趣的组合列值.如果您尝试添加已经存在的字符串,那么您将有一个重复的行.就像是:
HashSet<string> ScannedRecords = new HashSet<string>();
foreach (var row in dtCSV.Rows)
{
// Build a string that contains the combined column values
StringBuilder sb = new StringBuilder();
foreach (string col in columns)
{
sb.AppendFormat("[{0}={1}]", col, row[col].ToString());
}
// Try to add the string to the HashSet.
// If Add returns false, then there is a prior record with the same values
if (!ScannedRecords.Add(sb.ToString())
{
// This record is a duplicate.
}
}
Run Code Online (Sandbox Code Playgroud)
那应该非常快.
| 归档时间: |
|
| 查看次数: |
6786 次 |
| 最近记录: |