Ben Hancock

Computational Journalism, Python, and Linux

Whipping Data Tables Into Shape With Pandas


Shaping a Surfboard

As with shaping a surfboard, use the right tools when reshaping your data. Photo: Rastaopopoulos64, CC BY-SA 3.0, Wikimedia Commons

Often, when we get data from an outside source, it comes in a format that isn't ideal for analysis or generating visualizations. This is especially true in the case of messy, unstructured data. But even data that is stored in some sort of tabular format may be structured in a manner that's less than ideal. In order to be able to mine the story out of the data, we'll need to reshape it.

There are a lot of ways to reshape your data in Python. But one of the most powerful tools out there is pandas, short for "Python Data Analysis Library." If you're reading this blog, you likely already know what pandas is. But for the uninitiated, it's a third-party Python library created by programmer Wes McKinney that has become core to the Python data science community. It has an intuitive API once you get the hang of it, and good documentation to help navigate its full set of rich features.

In the examples below, I'm going to be using the IPython interactive shell. It's similar to working in Jupyter Notebook, but with less glitz (for folks who like working in the terminal). I'll import pandas as well as numpy in order to simulate some data. (Never mind the non-sequential "In" and "Out" numbers in some of the blocks below ... I did some experimentation in between.)

First, let's create a mock table loosely modeled on the SFO Air Traffic Passenger Statistics that can be downloaded from the DataSF website.

In [6]: import pandas as pd, numpy as np

In [7]: df = pd.DataFrame({
   ...:     'activity_period': [
   ...:         201809, 201809, 201809,
   ...:         201809, 201809, 201809,],
   ...:     'operating_airline': [
   ...:         'Compass Airlines', 'Frontier Airlines', 'Finnair',
   ...:         'Compass Airlines', 'Frontier Airlines', 'Finnair'],
   ...:     'activity': [
   ...:         'Enplaned', 'Enplaned', 'Enplaned',
   ...:         'Deplaned', 'Deplaned', 'Deplaned'],
   ...:     'passenger_count': np.random.randint(30000, size=6)
   ...:     })

In [8]: df
Out[8]:
   activity_period  operating_airline  activity  passenger_count
0           201809   Compass Airlines  Enplaned            29037
1           201809  Frontier Airlines  Enplaned             8388
2           201809            Finnair  Enplaned            21528
3           201809   Compass Airlines  Deplaned             3378
4           201809  Frontier Airlines  Deplaned             9049
5           201809            Finnair  Deplaned            11667

Data presented in this way is stored in what's referred to as the "long" format, with each row containing a separate observation. But what if we want to look at the passenger counts for each carrier on a single line, rather than broken up like this? To do that, we'll use the pandas DataFrame pivot method:

In [15]: df.pivot(index='operating_airline',
    ...:          columns='activity',
    ...:          values='passenger_count')

Out[15]:
activity           Deplaned  Enplaned
operating_airline
Compass Airlines       3378     29037
Finnair               11667     21528
Frontier Airlines      9049      8388

You can also show the corresponding activity period, like so:

In [16]: df.pivot(index='operating_airline',
    ...:          columns='activity',
    ...:          values=['activity_period', 'passenger_count'])

Out[16]:
              activity_period          passenger_count
activity                 Deplaned Enplaned        Deplaned Enplaned
operating_airline
Compass Airlines           201809   201809            3378    29037
Finnair                    201809   201809           11667    21528
Frontier Airlines          201809   201809            9049     8388

This works for this very small data set because we don't have multiple "Enplaned" or "Deplaned" observations for any of our operating carriers. If we did, we'd have to tell pandas how to aggregate the values somehow. We can see how this becomes an issue if we try to call pivot to see how many passengers each airline saw in each activity period.

In [27]: df.pivot(index='activity_period',
    ...:          columns='operating_airline',
    ...:          values='passenger_count')

ValueError: Index contains duplicate entries, cannot reshape

In other words, there are multiple rows of data for each operating airline (as we know), and pandas doesn't know how to combine them. To do this kind of aggregation, we'll have to use the pivot_table method, which takes an aggfunc keyword argument that instructs it how to combine the values.

In [28]: df.pivot_table(index='activity_period',
    ...:                columns='operating_airline',
    ...:                values='passenger_count',
    ...:                aggfunc='sum')

Out[28]:
operating_airline  Compass Airlines  Finnair  Frontier Airlines
activity_period
201809                        32415    33195              17437

For those who have spent time working in Excel, Google Sheets, or LibreOffice Calc, the notion of a pivot table will not be all that new. So why do this kind of data munging with Python and pandas? As noted in this Quora response , the two don't necessarily have to be seen as mutually exclusive, but different tools for different kinds of tasks. For browsing your data, spreadsheet programs like Excel and Calc make a lot of sense.

That said, there are some pretty strong reasons to use Python and pandas. The operations are scriptable, for one, so performing the same kinds of data munging tasks on multiple data sets is very fast. (This can be especially useful if you want to generate a series of reports with charts.) Using pandas and interactive computing tools like IPython and Jupyter Notebook also allows you to "show your work"; you can look over the various steps you've taken in cutting, reshaping, or aggregating the data, and more easily find mistakes in your analysis, while keeping your original data intact.

Hopefully this has been a helpful introduction. I'll post about more ways to reshape and clean data with pandas in future posts.

References: