Ann*_*son 0 python performance group-by python-itertools
I have a list of lists like this:
data = [['a', 'b', 2000, 100], ['a', 'b', 4000, 500], ['c', 'd', 500, 8000], ['c', 'd', 60, 8000], ['c', 'd', 70, 1000], ['a', 'd', 2000, 100], ['a', 'd', 1000, 100]]
Run Code Online (Sandbox Code Playgroud)
and I want to group them together if they have the same first two values. Output would be:
data = [(['a', 'b', 2000, 100], ['a', 'b', 4000, 500]), (['c', 'd', 500, 8000], ['c', 'd', 60, 8000], ['c', 'd', 70, 1000]), (['a', 'd', 2000, 100], ['a', 'd', 1000, 100])]
Run Code Online (Sandbox Code Playgroud)
The sublists with the same first two values are always adjacent to each other in list, but they vary in the number of how many I need to group.
I tried this:
from itertools import groupby
data = [['a', 'b', 2000, 100], ['a', 'b', 4000, 500], ['c', 'd', 500, 8000], ['c', 'd', 60, 8000], ['c', 'd', 70, 1000], ['a', 'd', 2000, 100], ['a', 'd', 1000, 100]]
output = [list(group) for key, group in groupby(data, lambda x:x[0])]
new_data = []
for l in output:
new_output = [tuple(group) for key, group in groupby(l, lambda x:x[1])]
for grouped_sub in new_output:
new_data.append(grouped_sub)
print(new_data)
Run Code Online (Sandbox Code Playgroud)
and got the output:
[(['a', 'b', 2000, 100], ['a', 'b', 4000, 500]), (['c', 'd', 500, 8000], ['c', 'd', 60, 8000], ['c', 'd', 70, 1000]), (['a', 'd', 2000, 100], ['a', 'd', 1000, 100])]
Run Code Online (Sandbox Code Playgroud)
Which is exactly what I was looking for. However, in reality, my list of lists is len(data) = 1000000 so running the groupby function twice with a total of three iterations is not efficient at all. Is there a way to alter my lambda function when I call the first groupby to consider both x[0] and x[1] when grouping?
Modify the key lambda to return a tuple containing both elements:
groupby(data, lambda x: tuple(x[0:2]))
Run Code Online (Sandbox Code Playgroud)
i.e. can be done in a single for-loop / list comprehension:
>>> [tuple(group) for key, group in groupby(data, lambda x: tuple(x[0:2]))]
[(['a', 'b', 2000, 100], ['a', 'b', 4000, 500]),
(['c', 'd', 500, 8000], ['c', 'd', 60, 8000], ['c', 'd', 70, 1000]),
(['a', 'd', 2000, 100], ['a', 'd', 1000, 100])]
Run Code Online (Sandbox Code Playgroud)