Ben Hancock

Computational Journalism, Python, and Linux

Collecting Data from Messy Websites


If you're a journalist (or a student, or a researcher, or pretty much anyone else who relies on the internet for information), you've undoubtedly been in this situation: you want to save a bunch of data from a website, but it's in messy or uncopyable format. With so many sites today now using JavaScript, even scrapers can be confounded by sites that appear full of data to the human eye, but that come back empty when scraped. If you want to remedy this problem -- and maybe even get a bit more information that you can immediately glimpse -- you need to get a deeper under the website's hood.

In this post, I'll be taking the lessons I learned from this post on the Online Journalism Blog and applying them to a slightly more messy situation I encountered earlier this year. I'll then get into extracting and parsing the data with, what else, Python!

My task seemed simple enough: get a list of the top non-U.S. companies listed on the Global 500 index that Fortune releases each year. Easy, right? I mean, they're all listed right here. The thing is, the Fortune website is a bloated JavaScript monster, with tons of auto-playing videos and other nasty stuff. So if you just try to copy, say, the first 10 companies and their ranks, you get something that looks like this:

NON-U.S. COMPANIES
RANKCOMPANYCOUNTRY
2State GridChina
3Sinopec GroupChina
4China National PetroleumChina
5Royal Dutch ShellNetherlands
6Toyota MotorJapan
7VolkswagenGermany
8BPBritain
12Samsung ElectronicsSouth Korea
14GlencoreSwitzerland
16DaimlerGermany  

Which is not terrible, and could probably be cleaned up with a macro in your text editor of choice. But we can do better than that. Plus, there's all this other data about each company you can get after you click its name. Wouldn't it be nice if we could grab that too?

To reiterate a bit of what's covered in the OJB post, the first thing you'll want to do is get acquainted with the "Inspect" feature in your web browser. I favor the tools available in Firefox (for reasons I'll get to in a second), but Chrome has basically the same functionality. Right-click anywhere on the site and scroll down to select the "Inspect Element" option. This opens up a set of windows with different tabs; the one we're interested is the "Network" tab -- and specifically, the sub-tab labeled "XHR". That's short for "XMLHttpRequest."

Fortune 500 Site Inspector Screen Grab

To understand why this is the tab we want, it's helpful to know that the data in this site isn't actually embedded in the HTML of the page. It's stored somewhere else and loaded in dynamically (i.e. it's being requested from another URL). Our first clue that that's the case is when we scroll past the first few dozen names in the list and then, pop!, a bunch more load.

Unlike the site in the OJB post, this page has a lot of stuff going on, and it's not immediately clear where the data we want is coming from. To figure out how this is happening, keep your "Inspect Element" pane open, reload the website, and then scroll down again. Watch that XHR tab closely, and you'll see a request go by called (simply) "yes" that returns json data. This is what we're looking for.

Taking a closer look at the json data is where Firefox really shines. If you were looking at the data in it's raw form, it would be a garbled mess of text, colons, and curly braces. With Firefox, you can just follow the requested URL and then start picking through the key-value pairs. The key we're interested in is called "list-items"; open the tree, and we find numbers 0 through 29, and nested underneath each is a company, its ranking, and a whole lot more data.

There's something curious though. What if we want the first 100 companies, not just the first 30? I'll jump ahead a bit here, but the clue is in the URL: http://fortune.com/api/v2/list/2386574/expand/item/ranking/asc/20/30/search//non-us-cos-y-n/yes. With a bit of experimentation, we can figure out that this is saying, "Give me the next 30 companies, starting the index at 20." So if we want the first 100, we would change the end of the URL to read asc/0/100/search//non-us-cos-y-n/yes. Et voilà, "list-items" now has keys for items 0 through 99, starting with the first non-U.S. company, "State Grid."

Screenshot of returned JSON data

Now that we've found where our data is, next is the fun part: extracting it and transforming it into a usable format with Python. Start by using the fantastic requests library to grab the data.

>>> import requests
>>> url = 'http://fortune.com/api/v2/list/2386574/expand/item/ranking/asc/0/100/search//non-us-cos-y-n/yes'
>>> r = requests.get(url)
>>> data = r.json()
>>> type(data)
<class 'dict'>

Now, we can work with the data as a Python dictionary, and output it as we'd like. For example, you could loop through all the companies and print their rank, and whether they were downgraded from the previous year.

>>> for item in data['list-items']:
...     co = item.get('title')
...     rank = item.get('rank')            
...     drop = item['filter'].get('rankdrop')
...     print(f'{co}, {rank}, {drop}')
...
State Grid, 2, no
Sinopec Group, 3, no
China National Petroleum, 4, no
Royal Dutch Shell, 5, no
Toyota Motor, 6, yes
Volkswagen, 7, yes
BP, 8, no
Samsung Electronics, 12, no
--snip--

The options go on from here. You can take the data and throw it into a spreadsheet using Python's csv module, or do some more analysis by creating a pandas dataframe. Maybe you'd like to examine the geographic distribution of the top non-U.S. companies, sort the companies by revenue change, or just get a list of all the CEOs.

Either way, the data has become a lot more rich and malleable than it was just sitting on that website, and allows us to dig in and draw more insights. And that's what it's all about.