Ben Hancock

Computational Journalism, Python, and Linux

What is Computational Journalism?

After I started my new job as Data Editor in the newsroom, I found that my family and others outside the office had a hard time understanding what, exactly, it was I would be doing. Explaining what a reporter does is straightforward enough; as I once jokingly summed it up to my young daughter, "A lot of writing, a lot of phone calls." But what does a "data journalist" do? Moreover, what is this thing called data journalism?

The name of this blog, of course, is not "Data Journalism & Python" — and there's a reason for that, which I will come to later. But for now let's stick with this question. I'll argue that computational journalism (my preferred term for this role and profession) is a superset of data journalism, and so it's helpful to start from a baseline that focuses on data.

Let's start with a recent example. Many readers have probably seen "Crazy Rich Asians" (or read the book). Not long after the movie was released, The New York Times took a deep look at new research from the Pew Research Center finding that income inequality among Asian-Americans has become the most severe among all ethnic groups in the U.S.

NYT Asian-American Inequalities Visualization
Source: The New York Times

Now, one could argue that the Times mainly just republished what these researchers found and produced, and that wouldn't be completely inaccurate. But the reporters also ostensibly had to understand what the data showed enough to break down the essential points, create (or re-create) visualizations that helped tell the story, and find other data that might further illuminate what's really going on. That is, I think, one form of what we could call "data journalism."

But there's another type that's a more apt description of this discipline, and that entails the journalists themselves mining out and parsing the data in order to find what the story is. There are great recent examples of this too, from the clever and visually-driven exposé on women's pants pockets by the engineer-journalists at The Pudding, to Pro Publica's jarring analysis of surgeon performance at hospitals across the U.S. Vox's work mapping neighborhood lead exposure risk by U.S. Census tract is another phenomenal example of impactful data journalism.

These are stories that depend upon the aggregation of a lot of information (sometimes gathered manually) in order to point to a larger trend. The approach is similar to what reporters do every day when they talk to multiple sources to put together the pieces of a story. But it's on a different scale. Often, stories of this magnitude couldn't be properly told without pulling together massive amounts of objective data, and using computation to process it and present the output in a compelling way.

Indeed, the importance of visualizations in data journalism shouldn't be understated. Numbers, by themselves, can be hard for readers to really grasp. A good visualization can illuminate a trend or pattern of facts that otherwise would be obscured; it can also drive home an idea that people may generally believe already. For instance, I would venture that most Californians who pay attention to the news pretty much know there are a lot of wildfires in the state, and that there seem to be more of them than before. But these visualizations created by Peter Aldhous at Buzzfeed paint that picture ever more clearly.

Buzzfeed fires visualization
Source: Buzzfeed

Of course, it isn't all about visualization, as this feature on public employee salaries by the Baltimore Sun shows. Sometimes, it's just about giving the audience access to information they wouldn't otherwise have. (The Baltimore Sun project also made use of a very cool Python library called Datasette, but that's a topic for another post.)

OK, got it. So if those are all examples of data journalism, what exactly is computational journalism, and why is that my preferred term? To frame this up, I'll start by borrowing the definition put forward by Columbia University's Jonathan Stray, who started the Computational Journalism Workbench and wrote this back in 2011:

I’d like to propose a working definition of computational journalism as the application of computer science to the problems of public information, knowledge, and belief, by practitioners who see their mission as outside of both commerce and government. This includes the journalistic mainstay of “reporting” — because information not published is information not known — but my definition is intentionally much broader than that. To succeed, this young discipline will need to draw heavily from social science, computer science, public communications, cognitive psychology and other fields, as well as the traditional values and practices of the journalism profession.

Stated another way, computational journalism is not just about crunching data that are already there, but using programming and software to bring new information to light. Take, for example, Gizmodo's creation of a tool that helped Facebook users better understand how the social media company was recommending "People You May Know." In an article about how Facebook tried to put the kibosh on the project, reporters for the outlet wrote about what the episode meant:

Journalists need to probe technological platforms in order to understand how unseen and little understood algorithms influence the experiences of hundreds of millions of people—whether it’s to better understand creepy friend recommendations, to uncover the potential for discrimination in housing ads, to understand how the fake follower economy operates, or to see how social networks respond to imposter accounts.

It's not just about generating a single story or a feature, either. Computational journalism can be about creating newsroom tools that enhance the information-gathering process. Writing web scrapers is one example; in fact, that's what Stray's CJ Workbench is all about. Another example might be the project that Anthony DeBarros (author of the great data journalist handbook Practical SQL) created called pneumatic, which is a Python library for bulk uploading documents to DocumentCloud.

Computational journalism, I would argue, is also closely related in spirit to "civic hacking." Done well, both are about using digital tools to advance the public interest. This can relate to criminal justice, as in Code for America's "Clear My Record" project, or more mundane municipal governance issues. One article that recently made the rounds on Hacker News was Matt Chapman's blog about how he used FOIA records and a Unix shell script to expose how unclear signage on one street in Chicago was leading to a bunch of parking tickets.

Stanford's Open Policing Project, which the school's Computational Journalism Lab contributed to, is another example of how algorithms and data can be used to expose issues that affect our society deeply. It's also helping aid more transparency in campaign finance through the California Civic Data Coalition.

As to why any of this matters, I would argue — admittedly, with a bit of professional self interest — that this is a role that is growing ever more important. This is a unique moment in time when every aspect of society is shaped by computers, software, and data. As the author Andrew Smith recently wrote in 1843, a magazine published by The Economist:

By accident more than design, coders now comprise a Fifth Estate and as 21st-century citizens we need to be able to interrogate them as deeply as we interrogate politicians, marketers, the players of Wall Street and the media.

The press is no exception to that. In order to maintain an open society, journalists need to be able to skillfully wield the tools of this age to uncover and explain what is happening. Far from just being the newsroom nerds, the role of the computational journalist is central to how media organizations navigate today's digital terrain.