Ben Hancock

Computational Journalism, Python, and Linux

Using grep to Search Through Text


For those of you familiar with the command line, grep is likely an everyday tool. This post is geared as an introduction to the utility for those who are aren't already command line wizards. Especially for journalists, finding items of interest in piles of documents is an important part of the job, and it's good to know that you don't just have to rely on Ctrl + F -- or even a third-party tool like Evernote -- in order to mine insights from large volumes of files.

Before I get too far on this topic, I should mention that there is more than one version of grep, and the syntax is not always the same. Most Linux systems will probably be installed with a relatively recent version of GNU grep, whereas macOS comes with BSD grep. I've tried to ensure that the commands given in this post work under both -- or at least note when they don't -- but it's still important to know there is a difference. If you find yourself running into problems, it may be due to that.

grep stands for globally search a regular expression and print [Wikipedia]. In simple terms, it searches for patterns through plain text files and outputs the results to your terminal screen. I've referenced regular expressions (or "regex") in a previous post without getting into them; because they're a heady topic, I'll still only get into some basics here. But the ability to use them is one of the reasons that grep is so powerful.

The other big reason is that, like many command-line tools in the *nix world, you can pipe the output of grep to yet other commands. This enables you to transform and manipulate your output. Here is a great discussion on the power of pipes.

So what does using grep look like? Well, let's say you've got a batch of .eml files, a common extension for email files from clients like Outlook. (This was actually something that a reporter on the NICAR listserv confronted recently.) If you want to search through them, you could fire up Outlook yourself and use its search functionality. But that's kind of clunky, and since they're just plain text files, grep is a fast way to run through them. Let's imagine your working directory looks like this:

$ ls
1.eml  2.eml  3.eml

Let's say you want to want to find an email that contains the word "foo". I've created a few sample files here that contain just one string each ("foo", "bar", and "baz"). Your command might look like this:

$ grep -R 'foo' .
./1.eml:foo

As with most commands, if you want to learn more about the syntax and how it works, you can invoke the man command, for manual, ahead of it (e.g. man grep). If you do that, you'll learn that the -R flag in the above command means that the search should be done recursively through every file in the directory and subdirectory. So if there was another folder within that directory that had a file containing the word "foo", that would be caught too.

$ grep -R 'foo' .
this-is-a-subdir/4.eml:foo this
1.eml:foo

The syntax above is one of those times where GNU grep and BSD grep differ (HT to John Hawkinson for pointing this out to me). Whereas with GNU grep you can leave off the trailing dot (.) and have it still work, that's not the case on macOS systems using BSD grep.

You can be more specific in your search though. Let's say you happened to be in a directory that isn't as clean, and maybe contains a bunch of other types of files that you don't want to search. You could limit the files that grep runs through to those ending with .eml using the wildcard character *. So this would look like:

$ grep 'foo' *.eml
1.eml:foo

Here, we've left off the -R flag, so we won't catch files in any subdirectories. And again, we'll only search files ending with .eml. If you're finding that the filename isn't included in the output, you can also pass the -H flag to ensure that it prints out as well.

With grep, we can also search for multiple terms in a couple of different ways. One is to use the regex alternation operator, which is a pipe (|) between the various terms. This falls under the category of extended regex patterns, and so in order to make this work, you'll need to pass the -E flag:

$ grep -E 'foo|bar' *.eml
1.eml:foo
2.eml:bar

The other way is to create a file with all of the terms you want to search for, each on its own line. This is super powerful stuff; imagine, for instance, that you want to see if any file in a directory mentions any name among a roster of individuals. So imagine that our text file, searchterms.txt, looks like this:

foo
bar
baz

We could then run a search with grep like this:

$ grep -f searchterms.txt *.eml
1.eml:foo
2.eml:bar
3.eml:baz

These are just simple examples, but hopefully they give you a flavor of the kinds of things you can do with grep. Once you start using regular expressions, you can also do even more interesting thing. Say, for example, that you wanted to find emails containing a word that starts with "b" and is followed by two other letters. You could do:

$ grep -E 'b\w{2}' *.eml
2.eml:bar
3.eml:baz

In this example, "b" is interpreted literally, while the metacharacter pattern \w{2} means two "word", or non-whitespace, characters (\w will also match underscore.) I'll cover regex's more in depth in another post.

Happy grep-ing!