Ben Hancock

Computational Journalism, Python, and Linux

Searching Python Dictionaries and Why You Should Care About 'Big O' Notation


When you start writing programs in Python, you'll eventually need to handle different kinds of data. In this post, I'm going to talk about some ways to do that while also introducing a concept that can help you avoid significant performance pitfalls. That concept is called "Big O" notation, and I'll come to it in just a bit.

Python has many different data structures -- lists, tuples, sets, etc. -- and one really nice feature of the language is the way these structures can be combined. You can have a list of lists, a dictionary of dictionaries, and so on. For capturing different records, one simple way of storing your data is in a list of dictionaries. This is a versatile data structure that can be saved as JSON, among other outputs.

Here's what a list of dictionaries looks like:

[
    {'name': 'Dade Murphy',
     'alias': 'Zero Cool',
     'skill': 'Wearing sweet vests'},

    {'name': 'Kate Libby',
     'alias': 'Acid Burn',
     'skill': 'Super 1337 hax0r'},

    # And so on ...
]

OK, so now you've got a bunch of data in a structure like this; maybe you pulled it down by scraping a website using BeautifulSoup or lxml. What do you do with it? You probably want to put it into a database, but setting that aside for now, another option is to write a simple script that will search these dictionaries for a given term and then return the ones that match.

For now, we're going to assume that we know exactly what the value is that we're looking for. That may be the case sometimes, but very likely, you may want to do some sort of pattern-based search. I'll save that for another blog post for now, though. (For those interested, I'd recommend checking out the chapter on regular expressions in the excellent No Starch Press book, The Linux Command Line.)

So our code will look something like this:

wanted_alias = 'Zero Cool'

hackers = [
    {'name': 'Dade Murphy',
     'alias': 'Zero Cool',
     'skill': 'Wearing sweet vests'},

    {'name': 'Kate Libby',
     'alias': 'Acid Burn',
     'skill': 'Super 1337 hax0r'}
    ]

for hacker in hackers:
    alias = hacker.get('alias')

    if alias == wanted_alias:
        print(hacker)

    else:
        continue

This is pretty straightforward. First, we're assigning the term we're looking for to a variable, wanted_alias. We could also get input from the user, using input() or a command-line argument. Then, we're looping through every dictionary in our list of dictionaries, and then asking it to return the value of alias by calling get(). (If you're scratching your head at these names and aliases, by the way, then you're missing out on one of the great masterpieces of mid-1990s nerd cinema). And then we check to see if the value is the same as the alias we're looking for.

When we run the program, the output we get is:

{'name': 'Dade Murphy', 'alias': 'Zero Cool', 'skill': 'Wearing sweet vests'}

Neat. But what if we wanted to look for a string in any value in a given dictionary? We could modify our for loop like so:

for hacker in hackers:
    for value in hacker.values():

        if value == wanted_alias:
            print(hacker)

        else:
            continue

The output of this program is the same as above, but our search looked through every value in the dictionary, including values stored at the name and skill keys. In other words, we were more thorough. That's good, right?

This is where Big O notation comes in. Even if you're not a full-time software engineer, mathematician, or CS masters student (and I fall into none of those categories), it's important to know the implications of what your code is doing. Especially if you're working with a massive data set -- like, say, a big dump of personnel data you've gotten from a FOIA request -- this can mean the difference between a script that runs in a few minutes and one that will churn for hours or a day.

A good overview of Big O notation can be found in this article by Rob Bell. At a high level, it's a way to measure the complexity and performance of an algorithm. Understanding how that applies in the above example takes a bit of wind-up, but bear with me. Under the hood, Python's dictionaries are implemented as what are called hash tables. In general, this means that lookups from a dictionary are very fast, and no matter how large the dictionary is, the value is returned at the same speed. Using Big O notation, this is expressed as O(1).

When we're measuring the performance of our for loop, however, we're talking about something different. In the flat (i.e. not nested) for loop up top, we search only one value per dictionary. The performance of our loop relates directly to how many dictionaries it has to churn through. It will take "on the order of" n times, where n is the number of dictionaries. The Big O notation for that is O(n). Or more specifically, in our case, O(2), because there are two dictionaries.

For our nested loop, the Big O notation is actually the same: O(n). But the kicker is that n is no longer the number of dictionaries, it's the number of dictionaries multiplied by the number of key-value pairs within them. So in the example above, the simple for loop has a Big O notation of O(2), but the nested for loop has a notation of O(6) -- 2 dictionaries x 3 key-value pairs. Just to emphasize, this is because we have to search through every one of the values, not just the ones stored at alias. When the data we're working with is small, that's no big deal; we don't need to fall victim to over-optimizing. But imagine that we had a data set with a million dictionaries, and each had 500 key-value pairs. You can see how that would make a difference.

To get a sense of what this means, we can use iPython and its convenient %timeit magic function to measure the performance down to the nanosecond. I moved the loops above into separate, appropriately named functions to make this experiment more readable, and added some more data as well:

In [1]: import hacker_functions 

In [2]: hackers = [  
   ...:     {'name': 'Dade Murphy', 
   ...:      'alias': 'Zero Cool', 
   ...:      'skill': 'Wearing sweet vests'}, 
   ...:  
   ...:     {'name': 'Kate Libby', 
   ...:      'alias': 'Acid Burn', 
   ...:      'skill': 'Super 1337 hax0r'}, 
   ...:       
   ...:     {'name': 'Paul Cook', 
   ...:      'alias': 'Lord Nikon', 
   ...:      'skill': 'Photographic memory'}, 
   ...:  
   ...:     {'name': 'Emmanuel Goldstein', 
   ...:      'alias': 'Cereal Killer', 
   ...:      'skill': 'Owning awesome hax0r books'},  
   ...:        
   ...:     {'name': 'Ramon Sanchez', 
   ...:      'alias': 'The Phantom Phreak', 
   ...:      'skill': 'pwning the phone company'} 
   ...:     ]                                                                                                                                                             
In [3]: %timeit hacker_functions.find_alias("Zero Cool", hackers) 
830 ns ± 4.21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit hacker_functions.nested_find_alias("Zero Cool", hackers) 
1.39 µs ± 17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

You can see the nested loop took almost twice as long. It's not exactly proportional to our notation, but you get the idea. The takeaway here is that while being thorough is good, being precise is better. Happy searching!