Diving into diff
The diff command compares two files and produces a list of the differences between the two. To be more accurate, it produces a list of the changes that would need to be made to the first file, to make it match the second file. If you keep that in mind you’ll find it easier to understand the output from diff. The diff command was designed to find differences between source code files and to produce an output that could be read and acted upon by other programs, such as the patch command. In this tutorial, we’re going to look at the most useful human-friendly ways to use diff.
Let’s dive right in and analyze two files. The order of the files on the command line determines which file diff considers to be the ‘first file’ and which it considers to be the “second file.” In the example below alpha1 is the first file, and alpha2 is the second file. Both files contain the phonetic alphabet but the second file, alpha2, has had some further editing so that the two files are not identical.
We can compare the files with this command. Type diff, a space, the name of the first file, a space, the name of the second file, and then press Enter.
How do we dissect that output? Once you know what to look for it’s not that bad. Each difference is listed in turn in a single column, and each difference is labeled. The label contains numbers either side of a letter, like 4c4. The first number is the line number in alpha1, and the second number is the line number in alpha2. The letter in the middle can be:
c: The line in the first file needs to be changed to match the line in the second file. d: The line in the first file must be deleted to match the second file. a: Extra content must be added to the first file to make it match the second file.
The 4c4 in our example tell us that line four of alpha1 must be changed to match line four of alpha2. This is the first difference between the two files that diff found.
Lines that begin with < refer to the first file, in our example alpha1, and lines that start with > refer to the second file, alpha2. The line < Delta tells us that the word Delta is the content of line four in alpha1. The line > Dave tells us that the word Dave is the content of line four in alpha2. To summarise then, we need to replace Delta with Dave on line four in alpha1, to make that line match in both files.
The next change is indicated by the 12c12. Applying the same logic, this tells us that line 12 in alpha1 contains the word Lima, but line 12 of alpha2 contains the word Linux.
The third change refers to a line that has been deleted from alpha2. The label 21d20 is deciphered as “line 21 needs to be deleted from the first file to make both files synchronize from line 20 onwards.” The < Uniform line shows us the content of the line which needs to be deleted from alpha1.
The fourth difference is labeled 26a26,28. This change refers to three extra lines that have been added to alpha2. Note the 26,28 in the label. Two-line numbers separated by a comma represents a range of line numbers. In this example, the range is from line 26 to line 28. The label is interpreted as “at line 26 in the first file, add lines 26 to 28 from the second file.” We are shown the three lines in alpha2 that need to be added to alpha1. These contain the words Quirk, Strange, and Charm.
Snappy One-Liners
If you all you want to know is whether two files are the same, use the -s (report identical files) option.
You can use the -q (brief) option to get an equally terse statement about two files being different.
One thing to watch out for is that with two identical files the-q (brief) option completely clams up and doesn’t report anything at all.
An Alternative View
The -y (side by side) option uses a different layout to describe the file differences. It is often convenient to use the -W (width) option with the side by side view, to limit the number of columns that are displayed. This avoids ugly wrap-around lines that make the output difficult to read. Here we have told diff to produce a side by side display and to limit the output to 70 columns.
The first file on the command line, alpha1, is shown on the left and the second line on the command line, alpha2, is shown on the right. The lines from each file are displayed, side by side. There are indicator characters alongside those lines in alpha2 that have been changed, deleted or added.
|: A line that has been changed in the second file. <: A line that has been deleted from the second file.
: A line that has been added to the second file that is not in the first file.
If you’d prefer a more compact side by side summary of the file differences, use the –suppress-common-lines option. This forces diff to list the changed, added or deleted lines only.
Add a Splash of Color
Another utility called colordiff adds color highlighting to the diff output. This makes it much easier to see which lines have differences.
Use apt-get to install this package onto your system if you’re using Ubuntu or another Debian-based distribution. On other Linux distributions, use your Linux distribution’s package management tool instead.
Use colordiff just as you would use diff.
In fact, colordiff is a wrapper for diff, and diff does all the work behind the scenes. Because of that, all of the diff options will work with colordiff.
Providing Some Context
To find some middle ground between having all of the lines in the files displayed on the screen and having only the changed lines listed, we can ask diff to provide some context. There are two ways to do this. Both ways achieve the same purpose, which is to show some lines before and after each changed line. You’ll be able to see what’s going on in the file at the place where the difference was detected.
The first method uses the -c (copied context) option.
The diff output has a header. The header lists the two file names and their modification times. There are asterisks (*) before the name of the first file and dashes (-) before the name of the second file. Asterisks and dashes will be used to indicate which file the lines in the output belong to.
A line of asterisks with 1,7 in the middle indicates we’re looking at lines from alpha1. To be precise, we’re looking at lines one to seven. The word Delta is flagged as changed. It has an exclamation point ( ! ) alongside it, and it is red. There are three lines of unchanged text displayed before and after that line so we can see the context of that line in the file.
The line of dashes with 1,7 in the middle tells us we’re now looking at lines from alpha2. Again, we’re looking at lines one to seven, with the word Dave on line four flagged as being different.
Three lines of context above and below each change is the default value. You can specify how many lines of context you want diff to provide. To do this, use the -C (copied context) option with a capital “C” and provide the number of lines you’d like:
The second diff option that offers context is the -u (unified context) option.
As before, we have a header on the output. The two files are named, and their modification times are shown. There are dashes (-) before the name of alpha1 and plus signs (+) before the name of alpha2. This tells us that dashes will be used to refer to alpha1 and plus signs will be used to refer to alpha2. Scattered throughout the listing are lines that start with at signs (@). These lines mark the start of each difference. They also tell us which lines are being shown from each file.
We are shown the three lines before and after the line flagged as being different so that we can see the context of the changed line. In the unified view, the lines with the difference are shown one above the other. The line from alpha1 is preceded by a dash and the line from alpha2 is preceded by a plus sign. This display achieves in eight lines what the copied context display above took fifteen to do.
As you’d expect, we can ask diff to provide exactly the number of lines of unified context we’d like to see. To do this, use the -U (unified context) option with a capital “U” and provide the number of lines you’d want:
Ignoring White Space and Case
Let’s analyze another two files, test4 and test5. These have the names six of superheroes in them.
The results show that diff finds nothing different with the Black Widow, Spider-Man and Thor lines. It does flag up changes with the Captain America, Ironman, and The Hulk lines.
So what’s different? Well, in test5 Hulk is spelled with a lowercase “h,” and Captain America has an extra space between “Captain” and “America.” OK, that’s plain to see, but what’s wrong with the Ironman line? There are no visible differences. Here’s a good rule of thumb. If you can’t see it, the answer is white space. There’s almost certainly a stray space or two, or a tab character, at the end of that line.
If they don’t matter to you, you can instruct diff to ignore specific types of line difference, including:
-i: Ignore differences in case. -Z: Ignore trailing white space. -b: Ignore changes in the amount of white space. -w: Ignore all white space changes.
Let’s ask diff to check those two files again, but this time to ignore any differences in case.
The lines with “The Hulk” and “The hulk” are now considered a match, and no difference is flagged for lowercase “h.” Let’s ask diff to also ignore trailing white space.
As suspected, trailing white space must have been the difference on the Ironman line because diff no longer flags a difference for that line. That leaves Captain America. Let’s ask diff to ignore case and to ignore all white space issues.
By telling diff to ignore the differences that we’re not concerned about, diff tells us that, for our purposes, the files match.
The diff command has many more options, but the majority of them relate to producing machine-readable output. These can be reviewed on the Linux man page. The options we’ve used in the examples above will enable you to track down all the differences between versions of your text files, using the command line and human eyeballs.