将字符串解析为动态类型的最快,最高效,优雅的方法？

Question

将字符串解析为动态类型的最快,最高效,优雅的方法？

aka*_*xer 11 c# string performance typeconverter

我正在寻找在旅途中将字符串转换为各种数据类型的最快(通用方法).

我正在解析由某事物生成的大型文本数据文件(文件大小为几兆字节).此特定函数读取文本文件中的行,根据分隔符将每行解析为列,并将解析后的值放入.NET DataTable中.稍后将其插入数据库中.FAR的瓶颈是字符串转换(Convert和TypeConverter).

我必须采用动态方式(即远离"Convert.ToInt32"等),因为我永远不知道文件中会出现什么类型.类型由运行时期间的早期配置确定.

到目前为止,我已经尝试了以下内容,并且需要花费几分钟来解析文件.请注意,如果我注释掉这一行,它只运行几百毫秒.

row[i] = Convert.ChangeType(columnString, dataType);

Run Code Online (Sandbox Code Playgroud)

和

TypeConverter typeConverter = TypeDescriptor.GetConverter(type);
row[i] = typeConverter.ConvertFromString(null, cultureInfo, columnString);

Run Code Online (Sandbox Code Playgroud)

如果有人知道这种通用的更快的方式,我想知道它.或者,如果我的整个方法因某种原因而糟透,我愿意接受建议.但请不要指出使用硬编码类型的非通用方法; 这根本不是一个选择.

更新 - 多线程以改进性能测试

为了提高性能,我研究了将解析任务拆分为多个线程.我发现速度有所提高,但仍然没有我想象的那么多.但是,对于那些感兴趣的人,这是我的结果.

系统:

英特尔Xenon 3.3GHz四核E3-1245

内存:12.0 GB

Windows 7 Enterprise x64

测试:

测试功能如下:

(1)接收一个字符串数组.(2)用分隔符分割字符串.(3)将字符串解析为数据类型并将它们存储在一行中.(4)向数据表添加行.(5)重复(2) - (4)直到完成.

测试包括1000个字符串,每个字符串被解析为16列,因此总共16000个字符串转换.我测试了单线程,4个线程(因为四核)和8个线程(因为超线程).因为我只是在这里处理数据,所以我怀疑添加更多的线程比这有什么好处.因此,对于单个线程,它解析1000个字符串,4个线程解析每个250个字符串,8个线程解析每个125个字符串.我还测试了一些使用线程的不同方法:线程创建,线程池,任务和函数对象.

结果: 结果时间以毫秒为单位.

单线程:

方法电话:17720

4个线程

参数化线程开始:13836
ThreadPool.QueueUserWorkItem:14075
Task.Factory.StartNew:16798
Func BeginInvoke EndInvoke:16733

8个线程

参数化线程开始:12591
ThreadPool.QueueUserWorkItem:13832
Task.Factory.StartNew:15877
Func BeginInvoke EndInvoke:16395

正如您所看到的,最快的是使用8个线程的参数化线程启动(我的逻辑核心数).然而,它并没有使用4个线程,并且仅比使用单个核心快约29%.当然结果会因机器而异.我也坚持了

    Dictionary<Type, TypeConverter>

Run Code Online (Sandbox Code Playgroud)

用于字符串解析的高速缓存使用类型转换器数组并没有提供明显的性能提升,并且有一个共享高速缓存类型转换器更易于维护,而不是在我需要时在整个地方创建数组.

另一个更新:

好吧,我跑了一些更多的测试,看看我是否可以挤出更多性能,我发现了一些有趣的东西.我决定坚持8个线程,所有从参数线程Start方法(这是最快的我以前的测试)开始.使用不同的解析算法运行与上面相同的测试.我注意到了

    Convert.ChangeType and TypeConverter

Run Code Online (Sandbox Code Playgroud)

花费大约相同的时间.键入特定的转换器

    int.TryParse

Run Code Online (Sandbox Code Playgroud)

因为我的类型是动态的,所以稍微快一点但不适合我.ricovox对异常处理有一些很好的建议.我的数据确实有无效的数据,一些整数列会为空数字设置' - ',所以类型转换器会爆炸:意味着我解析的每一行我至少有一个例外,那就是1000个异常!非常耗时.

顺便说一句,这就是我使用TypeConverter进行转换的方式.扩展只是一个静态类,GetTypeConverter只返回一个cahced TypeConverter.如果在转换期间抛出异常,则使用默认值.

public static Object ConvertTo(this String arg, CultureInfo cultureInfo, Type type, Object defaultValue)
{
  Object value;
  TypeConverter typeConverter = Extensions.GetTypeConverter(type);

  try
  {
    // Try converting the string.
    value = typeConverter.ConvertFromString(null, cultureInfo, arg);
  }
  catch
  {
    // If the conversion fails then use the default value.
    value = defaultValue;
  }

  return value;
}

Run Code Online (Sandbox Code Playgroud)

结果:

8个线程上的相同测试 - 解析1000行,每行16列,每个线程250行.

所以我做了3件新事.

1 - 运行测试:在解析之前检查已知的无效类型以最小化异常.即if(!Char.IsDigit(c))value = 0; 或者columnString.Contains(' - ')等...

运行时间:29ms

2 - 运行测试:使用具有try catch块的自定义解析算法.

运行时间:12424毫秒

3 - 运行测试:在解析之前使用自定义解析算法检查无效类型以最小化异常.

运行时间15ms

哇!正如您所看到的那样,消除异常使得世界变得不同.我从来没有意识到异常有多么昂贵!因此,如果我最小化我对TRULY未知案例的异常,那么解析算法的运行速度要快三个数量级.我正在考虑这个绝对解决了.我相信我会用TypeConverter保持动态类型转换,它只慢几毫秒.在转换之前检查已知的无效类型可以避免异常,并且可以极大地提高速度!感谢ricovox指出这一点让我进一步测试了这一点.

Answer 1

drw*_*ode 3

如果您主要要将字符串转换为本机数据类型（string、int、bool、DateTime 等），您可以使用类似下面的代码，它缓存 TypeCodes 和 TypeConverters（对于非本机类型）并使用快速 switch 语句可以快速跳转到适当的解析例程。这应该比 Convert.ChangeType 节省一些时间，因为源类型（字符串）已经知道，您可以直接调用正确的解析方法。

/* Get an array of Types for each of your columns.
 * Open the data file for reading.
 * Create your DataTable and add the columns.
 * (You have already done all of these in your earlier processing.)
 * 
 * Note:    For the sake of generality, I've used an IEnumerable<string> 
 * to represent the lines in the file, although for large files,
 * you would use a FileStream or TextReader etc.
*/      
IList<Type> columnTypes;        //array or list of the Type to use for each column
IEnumerable<string> fileLines;  //the lines to parse from the file.
DataTable table;                //the table you'll add the rows to

int colCount = columnTypes.Count;
var typeCodes = new TypeCode[colCount];
var converters = new TypeConverter[colCount];
//Fill up the typeCodes array with the Type.GetTypeCode() of each column type.
//If the TypeCode is Object, then get a custom converter for that column.
for(int i = 0; i < colCount; i++) {
    typeCodes[i] = Type.GetTypeCode(columnTypes[i]);
    if (typeCodes[i] == TypeCode.Object)
        converters[i] = TypeDescriptor.GetConverter(columnTypes[i]);
}

//Probably faster to build up an array of objects and insert them into the row all at once.
object[] vals = new object[colCount];
object val;
foreach(string line in fileLines) {
    //delineate the line into columns, however you see fit. I'll assume a tab character.
    var columns = line.Split('\t');
    for(int i = 0; i < colCount) {
        switch(typeCodes[i]) {
            case TypeCode.String:
                val = columns[i]; break;
            case TypeCode.Int32:
                val = int.Parse(columns[i]); break;
            case TypeCode.DateTime:
                val = DateTime.Parse(columns[i]); break;
            //...list types that you expect to encounter often.

            //finally, deal with other objects
            case TypeCode.Object:
            default:
                val = converters[i].ConvertFromString(columns[i]);
                break;
        }
        vals[i] = val;
    }
    //Add all values to the row at one time. 
    //This might be faster than adding each column one at a time.
    //There are two ways to do this:
    var row = table.Rows.Add(vals); //create new row on the fly.
    // OR 
    row.ItemArray = vals; //(e.g. allows setting existing row, created previously)
}

Run Code Online (Sandbox Code Playgroud)

确实没有任何其他方法会更快，因为我们基本上只是使用类型本身定义的原始字符串解析方法。您可以自己为每种输出类型重写自己的解析代码，从而针对您将遇到的确切格式进行优化。但我认为这对你的项目来说太过分了。在每种情况下简单地定制 FormatProvider 或 NumberStyles 可能会更好更快。

例如，假设每当您解析 Double 值时，您知道，根据您的专有文件格式，您不会遇到任何包含指数等的字符串，并且您知道不会有任何前导或尾随空格等因此，您可以使用 NumberStyles 参数向解析器提供这些信息，如下所示：

//NOTE:   using System.Globalization;
var styles = NumberStyles.AllowDecimalPoint | NumberStyles.AllowLeadingSign;
var d = double.Parse(text, styles);

Run Code Online (Sandbox Code Playgroud)

我不知道解析是如何实现的，但我认为 NumberStyles 参数允许解析例程通过排除各种格式化可能性来更快地工作。当然，如果您不能对数据的格式做出任何假设，那么您将无法进行这些类型的优化。

当然，您的代码总是有可能很慢，因为将字符串解析为某种数据类型需要时间。使用性能分析器（如 VS2010）来尝试查看实际瓶颈在哪里。然后您将能够更好地优化，或者干脆放弃，例如，在没有其他方法可以在汇编中编写解析例程的情况下:-)

归档时间：	13 年前
查看次数：	4219 次
最近记录：	12 年，11 月前