Cas*_*ton 2 .net c# linq performance
I need to filter a List<object> so that I remove any items where a string property does not exist inside another List<string>.
I created this console app just to make sure I had the LINQ syntax correct:
class FooBar
{
public int Id { get; set; }
public string ValueName { get; set; }
}
Run Code Online (Sandbox Code Playgroud)
and then...
List<FooBar> foobars = new List<FooBar>
{
new FooBar { Id = 1, ValueName = "Val1" },
new FooBar { Id = 2, ValueName = "Val2" },
new FooBar { Id = 3, ValueName = "Val3" },
new FooBar { Id = 4, ValueName = "Val4" }
};
List<string> myStrings = new List<string>
{
"Val1",
"Val3"
};
// Only keep records where ValueName is found in `myStrings`
foobars = foobars.Where(f => myStrings.Contains(f.ValueName)).ToList();
Run Code Online (Sandbox Code Playgroud)
So, this line:
foobars = foobars.Where(f => myStrings.Contains(f.ValueName)).ToList();
Run Code Online (Sandbox Code Playgroud)
does exactly what I want, and it gives me back these two records:
{ Id = 1, ValueName = "Val1" }
{ Id = 3, ValueName = "Val3" }
Run Code Online (Sandbox Code Playgroud)
All's well. BUT... in the actual application, foobars has over 200k items, and myStrings has about 190k. And when that LINQ line gets executed, it takes upwards of 5 minutes to complete.
I'm clearly doing something wrong. 200k records isn't THAT big. And the real FooBar isn't all that complex (no nested objects, and only 9 properties).
What's going on here?
The problem here is that you're doing foobars.Where(f => myStrings.Contains(f.ValueName)) , so for every item in foobars your are checking all items of myStrings.
That scales quadratic. Also called O(n^2), read more here. So if you have 10+10 items, you do 100 checks(10*10), and if you have 10,000+10,000 items, you will do 100,000,000 checks. In your case you're doing 38,000,000,000+ checks ;)
Solution: create a HashSet from myStrings and use Contains of the HashSet.
e.g. replace with:
var myStringsSet = new HashSet<string>(myStrings);
foobars = foobars.Where(f => myStringsSet.Contains(f.ValueName)).ToList();
Run Code Online (Sandbox Code Playgroud)
Now with 10,000+10,000 items, you will do 10,000 checks instead of 100,000,000. In your case that will 200,000 checks instead of 38,000,000,000.