Ben Hancock

Computational Journalism, Python, and Linux

Converting PDFs to Plain Text


In my last post, I talked about the power of grep in searching through plain text files. If you're like many journalists – overwhelmed with documents in a format other than plain text – you may have been thinking, "Gee, that's great. But this is not terribly useful for my work."

In this post, I'm going to pass on some tips for converting PDFs to plain text that I learned at the 2019 NICAR conference from Chad Day, who was then at the Associated Press covering the Mueller investigation, and now is with the Wall Street Journal. Plain text is not only useful for searching, but is also much more lightweight (that is, more friendly for your hard drive space) compared to beefy PDFs. There are some occasions when you will likely want to search the PDFs themselves; in those cases, pdfgrep [link] can also be a great tool, but that's a subject for another post.

You can find Chad's GitHub repo from his presentation here. Note that his presentation is geared toward macOS users who are installing packages with homebrew. I'll cover a couple other things to be mindful of when using the same tools on Linux.

First, a quick distinction: there are two main kinds of PDF documents. There are machine-readable PDFs, which are the kind you can copy and search through using Ctrl + F; and then there are non-machine-readable PDFs. These are the kind government offices love to send you in response to a FOIA dump; ugly, scanned documents that threaten to send you on an interminable document dive. Never fear! You can unlock the secrets of both kinds.

Machine-Readable

For machine-readable PDFs, the simplest tool to use is xpdf [link], which is invoked at the command line with pdftotext. You can install it using your package manager (e.g. apt on Ubuntu-based Linux distros) or using homebrew if you're on a Mac. The sytnax is pretty straightforward:

$ pdftotext infile.pdf outfile.txt

Et voila! You're machine-readable PDF went from this:

img

To this:

img

This is an example from a court case I covered, the Waymo v. Uber litigation. Note that specifying the name of the output file is optional. You can batch process a bunch of PDF files using the wildcard character * (e.g. pdftotxt *.pdf, but be aware of when the extension is actually .PDF), or by writing a short shell script to loop through every PDF file in a directory. Bonus points: you can clean up long lines in the text file output using the command line utility fmt as well.

Chad also shows in his repo how pdftotext can be invoked with a -table flag to convert tabular data to plain text while preserving its column formatting. But be aware that the -table option doesn't seem to be available in all versions of the underlying xpdf tool. Newer versions do seem to support it, if you download from the website. (Note that you can also download support for a broader range of languages, including Arabic, Korean, Cyrillic, and both simplified and traditional Chinese.)

Non-Machine-Readable

Dealing with non-machine-readable PDFs is a little more involved; you'll need to use several tools together in order to achieve this. At a high level, the steps you'll need to take are:

  1. Convert the PDF to an image
  2. Run optical character recognition (OCR) on that image
  3. Convert the OCR'd image back to plain text

The tools we'll be using are pdftoppm for step 1, and tesseract for steps 2 and 3. Chad uses Imagemagick for step 1 in his tutorial, but following a suggestion from Thomas Wilburn at NPR, I've found the at pdftoppm can be faster and just as effective – although your mileage may vary depending on the use case.

For this example, I'm going to use a batch of emails that were disclosed by the U.S. Trade Representative's office under FOIA and posted by the Electronic Frontier Foundation. These emails are from 2015 and include emails between U.S. trade officials and corporate lobbyists about a proposed international trade agreement known as TISA – something I used to cover on my old beat in DC.

First, make a directory called output, and then invoke the following command:

$ pdftoppm -png -f 1 tisa_foia_ustrcomms.pdf "output/tisa_out"

Here, we're telling the utility to convert the PDF to a bunch of .png files to our output directory, beginning each file name with tisa_out, and starting at page 1. This is the top of what the directory list looks like now:

tisa_out-001.png
tisa_out-002.png
tisa_out-003.png
tisa_out-004.png
-- etc. ---

Now we want to build a plain text list of all of our files, which we can do using the find command like so:

$ find output/ -type f -name *.png > filelist.txt

We're ready to move on to step 2, and for this we'll use tesseract, an open source OCR engine originally developed by Hewlett-Packard and now maintained by Google [GitHub]. Tesseract has many options, so consult the full manual if you're curious, but for now we'll just stick to the basics.

Before we enter our command, you may want to first take note of whether the filelist.txt file you created listed the pages in order. If not, use the command sort on the file before getting started. Then enter this:

$ tesseract filelist.txt tisa_ocr

You'll see tesseract begin to count the pages it goes through, occasionally with messages like "Detected 10 diacritics." The command above should output this to a file called tisa_ocr.txt. You can choose other file formats as well, such as PDF; again, consult the man page for more info.

You're done! Now you can quickly search through this 120-some page document using grep. You could use a file with a list of names or topics to quickly assess where the interesting stuff is. Or you could use a regex pattern to pull out every email address in the correspondence (those that aren't blacked out, at least).

A couple quick tips to make searching through this more easy: use the fmt command (as mentioned above) with a desired column width to make it easier to read. You can also invoke grep's -n flag to print the line number of the match, so you can find it and read it for more context. Or you can even use the -C flag with a number, which will print out n amount of surrounding lines of context for you.

EDIT [2019-08-20]: An earlier version of this post mistakenly suggested that you would need to use pdftotext after using tesseract.