Mat*_*ren 9 .net linq performance foreach
我试图通过在C#中实现Peter Norvig的拼写纠正器来学习更多关于LINQ的知识.
第一部分涉及获取大量文字(约1百万)并将其放入字典中,其中key
是单词,并且value
是出现次数.
我通常会这样做:
foreach (var word in allWords)
{
if (wordCount.ContainsKey(word))
wordCount[word]++;
else
wordCount.Add(word, 1);
}
Run Code Online (Sandbox Code Playgroud)
哪里allWords
是IEnumerable<string>
在LINQ中我现在这样做:
var wordCountLINQ = (from word in allWordsLINQ
group word by word
into groups
select groups).ToDictionary(g => g.Key, g => g.Count());
Run Code Online (Sandbox Code Playgroud)
我通过查看所有的两个词典进行比较<key, value>
,它们是相同的,所以它们产生了相同的结果.
该foreach
循环需要3.82秒和LINQ查询需要4.49秒
我正在使用秒表类计时,我正在RELEASE模式下运行.我不认为表现不好我只是想知道是否有差异的原因.
我是以低效的方式进行LINQ查询还是我错过了什么?
更新:这是完整的基准代码示例:
public static void TestCode()
{
//File can be downloaded from http://norvig.com/big.txt and consists of about a million words.
const string fileName = @"path_to_file";
var allWords = from Match m in Regex.Matches(File.ReadAllText(fileName).ToLower(), "[a-z]+", RegexOptions.Compiled)
select m.Value;
var wordCount = new Dictionary<string, int>();
var timer = new Stopwatch();
timer.Start();
foreach (var word in allWords)
{
if (wordCount.ContainsKey(word))
wordCount[word]++;
else
wordCount.Add(word, 1);
}
timer.Stop();
Console.WriteLine("foreach loop took {0:0.00} ms ({1:0.00} secs)\n",
timer.ElapsedMilliseconds, timer.ElapsedMilliseconds / 1000.0);
//Make LINQ use a different Enumerable (with the exactly the same values),
//if you don't it suddenly becomes way faster, which I assmume is a caching thing??
var allWordsLINQ = from Match m in Regex.Matches(File.ReadAllText(fileName).ToLower(), "[a-z]+", RegexOptions.Compiled)
select m.Value;
timer.Reset();
timer.Start();
var wordCountLINQ = (from word in allWordsLINQ
group word by word
into groups
select groups).ToDictionary(g => g.Key, g => g.Count());
timer.Stop();
Console.WriteLine("LINQ took {0:0.00} ms ({1:0.00} secs)\n",
timer.ElapsedMilliseconds, timer.ElapsedMilliseconds / 1000.0);
}
Run Code Online (Sandbox Code Playgroud)
LINQ版本较慢的原因之一是因为不是一个字典,而是创建了两个字典:
(内部)来自集团的运营商; 该组还存储每个单词.您可以通过查看ToArray()而不是Count()来验证这一点.在您的情况下,这实际上并不需要很多开销.
ToDictionary方法基本上是对实际LINQ查询的预测,其中查询的结果被添加到新字典中.根据独特单词的数量,这也可能需要一些时间.
LINQ查询稍微慢一点的另一个原因是因为LINQ依赖于lambda表达式(Dathan的答案中的委托),并且调用委托与内联代码相比增加了很少的开销.
编辑:请注意,对于某些LINQ场景(例如LINQ to SQL,但不是内存LINQ,例如此处),重写查询会产生更优化的计划:
from word in allWordsLINQ
group word by word into groups
select new { Word = groups.Key, Count = groups.Count() }
Run Code Online (Sandbox Code Playgroud)
但请注意,这并不是一个字典,而是一系列单词及其计数.您可以将其转换为字典
(from word in allWordsLINQ
group word by word into groups
select new { Word = groups.Key, Count = groups.Count() })
.ToDictionary(g => g.Word, g => g.Count);
Run Code Online (Sandbox Code Playgroud)