Saturday, August 9, 2014

Python GroupBy, Map & Reduce

I came across a really interesting data mangling technique while watching this presentation on advanced Python programming techniques

Here is the example from this talk. Suppose you have a list of dictionary objects, sorted by the 'id' key, my_list as defined below.
>>> my_list = [
    {'id':1, 'name':'raymond'},
    {'id':1, 'email':''},
    {'id':2, 'name':'sue'},
    {'id':2, 'email':''}] #sorted

Using dictgroupby, map and reduce, there is a very elegant way of grouping all of those dictionaries by the 'id' to get the following list:

  {'id': 1, 'email': '', 'name': 'raymond'}, 
  {'id': 2, 'email': '', 'name': 'sue'}

And here is how you do it:
>>>from itertools import groupby
>>> [dict (
    reduce(lambda y,z: y + z,
        map(lambda x: x.items(), v)
for k, v in groupby(my_list, key=lambda x: x['id'])]

Notice how much this is 'SQL' like. Let us break this statement down to its individual components to understand what is happening under the covers. Like any SQL statement, we have to start deciphering it inside out.

Let us look at the for loop with the groupby operation in it. The groupby operation will return an iterator grouping by the 'key' parameter. In this case, it is an anonymous function that returns the 'id' value. Essentially, we are asking for the grouping to happen using the 'id' values (1, 2 etc.).

>>>print({k:list(v) for k,v in groupby(my_list, key=lambda x: x['id'])})

 1: [{'id': 1, 'name': 'raymond'}, 
     {'id': 1, 'email': ''}], 
 2: [{'id': 2, 'name': 'sue'}, 
     {'id': 2, 'email': ''}]

Note that the 'id' values are the keys and they are also repeated as part of the values. This will come in handy for the next step.

Next, we map the anonymous function, which calls the items() method for the parameter passed in, over each of the groups returned from the groupby operation.

>>> for k, v in groupby(my_list, key=lambda x: x['id']):
...     print(map(lambda x: x.items(), v))
[[('id', 1), ('name', 'raymond')], [('id', 1), ('email', '')]]
[[('id', 2), ('name', 'sue')], [('id', 2), ('email', '')]]

This gives us lists of tuples instead of a dictionaries, which makes the reduction step very easy.

Now, we reduce the list that comes out of the mapping step by doing a simple addition of lists. Addition over two lists results in a list with elements from both. We would not have been able to use the '+' operator if these were dictionaries instead. Notice also the duplicate 'id' tuple.
>>>print(reduce(lambda y,z: y + z, [[('id', 1), ('name', 'raymond')], [('id', 1), ('email', '')]]))
[('id', 1), ('name', 'raymond'), ('id', 1), ('email', '')]

The reduce step will also happen for the list for 'id' 2 in this example.

We are almost at the end now. The last step is to make a dictionary from the list coming out of the reduce step to remove duplicates and conform back to the input which was a list of dictionaries.

>>>print(dict([('id', 1), ('name', 'raymond'), ('id', 1), ('email', '')]))
{'id': 1, 'email': '', 'name': 'raymond'}

Note that the duplicate 'id' tuple got removed. As before, this step will also happen for the list for 'id' 2.

That is it! Now, we have what we need. As list of dictionary objects, grouped by 'id'.

  {'id': 1, 'email': '', 'name': 'raymond'}, 
  {'id': 2, 'email': '', 'name': 'sue'}

What are some of the practical uses you see for this technique? Do you have any other slick trick to share?Let me know in the comments below.