用于将项添加到字典的LINQ方法

Mat*_*ren 9 .net linq performance foreach

我试图通过在C#中实现Peter Norvig的拼写纠正器来学习更多关于LINQ的知识.

第一部分涉及获取大量文字(约1百万)并将其放入字典中,其中key是单词,并且value是出现次数.

我通常会这样做:

foreach (var word in allWords)                                                    
{           
    if (wordCount.ContainsKey(word))
        wordCount[word]++;
    else
        wordCount.Add(word, 1);
}
Run Code Online (Sandbox Code Playgroud)

哪里allWordsIEnumerable<string>

在LINQ中我现在这样做:

var wordCountLINQ = (from word in allWordsLINQ
                         group word by word
                         into groups
                         select groups).ToDictionary(g => g.Key, g => g.Count());  
Run Code Online (Sandbox Code Playgroud)

我通过查看所有的两个词典进行比较<key, value>,它们是相同的,所以它们产生了相同的结果.

foreach循环需要3.82秒和LINQ查询需要4.49秒

我正在使用秒表类计时,我正在RELEASE模式下运行.我不认为表现不好我只是想知道是否有差异的原因.

我是以低效的方式进行LINQ查询还是我错过了什么?

更新:这是完整的基准代码示例:

public static void TestCode()
{
    //File can be downloaded from http://norvig.com/big.txt and consists of about a million words.
    const string fileName = @"path_to_file";
    var allWords = from Match m in Regex.Matches(File.ReadAllText(fileName).ToLower(), "[a-z]+", RegexOptions.Compiled)
                   select m.Value;

    var wordCount = new Dictionary<string, int>();
    var timer = new Stopwatch();            
    timer.Start();
    foreach (var word in allWords)                                                    
    {           
        if (wordCount.ContainsKey(word))
            wordCount[word]++;
        else
            wordCount.Add(word, 1);
    }
    timer.Stop();

    Console.WriteLine("foreach loop took {0:0.00} ms ({1:0.00} secs)\n",
            timer.ElapsedMilliseconds, timer.ElapsedMilliseconds / 1000.0);

    //Make LINQ use a different Enumerable (with the exactly the same values), 
    //if you don't it suddenly becomes way faster, which I assmume is a caching thing??
    var allWordsLINQ = from Match m in Regex.Matches(File.ReadAllText(fileName).ToLower(), "[a-z]+", RegexOptions.Compiled)
                   select m.Value;

    timer.Reset();
    timer.Start();
    var wordCountLINQ = (from word in allWordsLINQ
                            group word by word
                            into groups
                            select groups).ToDictionary(g => g.Key, g => g.Count());  
    timer.Stop();

    Console.WriteLine("LINQ took {0:0.00} ms ({1:0.00} secs)\n",
            timer.ElapsedMilliseconds, timer.ElapsedMilliseconds / 1000.0);                     
}
Run Code Online (Sandbox Code Playgroud)

Rub*_*ben 6

LINQ版本较慢的原因之一是因为不是一个字典,而是创建了两个字典:

  1. (内部)来自集团的运营商; 该组还存储每个单词.您可以通过查看ToArray()而不是Count()来验证这一点.在您的情况下,这实际上并不需要很多开销.

  2. ToDictionary方法基本上是对实际LINQ查询的预测,其中查询的结果被添加到新字典中.根据独特单词的数量,这也可能需要一些时间.

LINQ查询稍微慢一点的另一个原因是因为LINQ依赖于lambda表达式(Dathan的答案中的委托),并且调用委托与内联代码相比增加了很少的开销.

编辑:请注意,对于某些LINQ场景(例如LINQ to SQL,但不是内存LINQ,例如此处),重写查询会产生更优化的计划:

from word in allWordsLINQ 
group word by word into groups 
select new { Word = groups.Key, Count = groups.Count() }
Run Code Online (Sandbox Code Playgroud)

但请注意,这并不是一个字典,而是一系列单词及其计数.您可以将其转换为字典

(from word in allWordsLINQ 
 group word by word into groups 
 select new { Word = groups.Key, Count = groups.Count() })
.ToDictionary(g => g.Word, g => g.Count);
Run Code Online (Sandbox Code Playgroud)