In my last post, I talked about the power of
grep in searching through plain
text files. If you're like many journalists – overwhelmed with documents in a
format other than plain text – you may have been thinking, "Gee, that's
great. But this is not terribly useful for my work."
In this post, I'm going to pass on some tips for converting PDFs to plain text
that I learned at the 2019 NICAR conference from Chad Day, who was then at the
Associated Press covering the Mueller investigation, and now is with the Wall
Street Journal. Plain text is not only useful for searching, but is also much
more lightweight (that is, more friendly for your hard drive space) compared to
beefy PDFs. There are some occasions when you will likely want to search the
PDFs themselves; in those cases,
pdfgrep [link] can also be a great tool, but
that's a subject for another post.
You can find Chad's GitHub repo from his presentation here. Note that his presentation is geared toward macOS users who are installing packages with homebrew. I'll cover a couple other things to be mindful of when using the same tools on Linux.
First, a quick distinction: there are two main kinds of PDF documents. There
are machine-readable PDFs, which are the kind you can copy and search through
Ctrl + F; and then there are non-machine-readable PDFs. These are the
kind government offices love to send you in response to a FOIA dump; ugly,
scanned documents that threaten to send you on an interminable document
dive. Never fear! You can unlock the secrets of both kinds.
For machine-readable PDFs, the simplest tool to use is
xpdf [link], which is
invoked at the command line with
pdftotext. You can install it using your
package manager (e.g.
apt on Ubuntu-based Linux distros) or using homebrew if
you're on a Mac. The sytnax is pretty straightforward:
$ pdftotext infile.pdf outfile.txt
Et voila! You're machine-readable PDF went from this:
This is an example from a court case I covered, the Waymo v. Uber
litigation. Note that specifying the name of the output file is optional. You
can batch process a bunch of PDF files using the wildcard character
pdftotxt *.pdf, but be aware of when the extension is actually
fmt as well.
Chad also shows in his repo how
pdftotext can be invoked with a
to convert tabular data to plain text while preserving its column
formatting. But be aware that the
-table option doesn't seem to be available
in all versions of the underlying
xpdf tool. Newer versions do seem to
support it, if you download from the website. (Note that you can also download
support for a broader range of languages, including Arabic, Korean, Cyrillic,
and both simplified and traditional Chinese.)
Dealing with non-machine-readable PDFs is a little more involved; you'll need to use several tools together in order to achieve this. At a high level, the steps you'll need to take are:
- Convert the PDF to an image
- Run optical character recognition (OCR) on that image
- Convert the OCR'd image back to plain text
The tools we'll be using are
pdftoppm for step 1, and
tesseract for steps 2
and 3. Chad uses
Imagemagick for step 1 in his tutorial, but following a
suggestion from Thomas Wilburn at
I've found the at
pdftoppm can be faster and just as effective –
although your mileage may vary depending on the use case.
For this example, I'm going to use a batch of emails that were disclosed by the U.S. Trade Representative's office under FOIA and posted by the Electronic Frontier Foundation. These emails are from 2015 and include emails between U.S. trade officials and corporate lobbyists about a proposed international trade agreement known as TISA – something I used to cover on my old beat in DC.
First, make a directory called
output, and then invoke the following command:
$ pdftoppm -png -f 1 tisa_foia_ustrcomms.pdf "output/tisa_out"
Here, we're telling the utility to convert the PDF to a bunch of
to our output directory, beginning each file name with
tisa_out, and starting
at page 1. This is the top of what the directory list looks like now:
tisa_out-001.png tisa_out-002.png tisa_out-003.png tisa_out-004.png -- etc. ---
Now we want to build a plain text list of all of our files, which we can do
find command like so:
$ find output/ -type f -name *.png > filelist.txt
We're ready to move on to step 2, and for this we'll use
open source OCR engine originally developed by Hewlett-Packard and now
maintained by Google [GitHub]. Tesseract has many options, so consult the full
manual if you're curious, but for now we'll just stick to the basics.
Before we enter our command, you may want to first take note of whether the
filelist.txt file you created listed the pages in order. If not, use the
sort on the file before getting started. Then enter this:
$ tesseract filelist.txt tisa_ocr
tesseract begin to count the pages it goes through, occasionally
with messages like "Detected 10 diacritics." The command above should output
this to a file called
tisa_ocr.txt. You can choose other file formats as
well, such as PDF; again, consult the
man page for more info.
You're done! Now you can quickly search through this 120-some page document
grep. You could use a file with a list of names or topics to quickly
assess where the interesting stuff is. Or you could use a regex pattern to pull
out every email address in the correspondence (those that aren't blacked out,
A couple quick tips to make searching through this more easy: use the
command (as mentioned above) with a desired column width to make it easier to
read. You can also invoke
-n flag to print the line number of the
match, so you can find it and read it for more context. Or you can even use the
-C flag with a number, which will print out n amount of surrounding lines
of context for you.
EDIT [2019-08-20]: An earlier version of this post mistakenly suggested that
you would need to use
pdftotext after using