Skip to content
T
Tools.Town
Free Online Tools for Everyone
Developer Tools

How a Diff Checker Works: LCS, Hunks, and Diff Levels

A developer's guide to text and code diffing — the Longest Common Subsequence algorithm, line versus word versus character granularity, ignore-whitespace and ignore-case normalisation, and how to read a side-by-side diff.

23 June 2026 4 min read By Tools.Town Team Fact Checked

Key Takeaways

  • Most diff tools, including ours, build on the Longest Common Subsequence (LCS)
  • Diff by line for code and config, where each line is a meaningful unit
  • Reformatting, re-indentation, and trailing-space cleanup can rewrite hundreds of lines without changing behaviour

The problem a diff solves

Two versions of the same file sit in front of you and the question is always the same: what changed? For a one-line edit your eyes can answer it. For anything larger — a refactored function, a reworked paragraph, two config files that drifted apart — manual comparison is slow and unreliable. You skim past a flipped sign, a renamed variable, a duplicated line, and the bug ships. A diff checker turns that fuzzy human task into an exact, colour-coded answer: these lines were added, these were removed, the rest are unchanged. The Diff Checker tool does this in your browser, with line, word, and character precision.

What “difference” actually means

It is tempting to think a diff just compares line 1 to line 1, line 2 to line 2, and so on. That naive approach falls apart the moment a line is inserted near the top: every line below it shifts by one, and a positional comparison reports the entire rest of the file as changed. A good diff is alignment-aware. It understands that if you insert a line, everything below it is still the same content, just shifted down.

The standard way to compute this is the Longest Common Subsequence, or LCS. Given two sequences of tokens — say, the lines of file A and file B — the LCS is the longest ordered list of tokens that appears in both, not necessarily contiguously. Once you know the LCS, the diff writes itself: any token in A that is not part of the common subsequence was removed, any token in B that is not part of it was added, and the shared tokens are unchanged.

How the LCS algorithm works

The classic LCS is a dynamic-programming routine. You build a table where cell (i, j) holds the length of the longest common subsequence of the first i tokens of A and the first j tokens of B. The rule for filling it is short: if token A[i] equals token B[j], the cell is one more than the diagonal neighbour (i-1, j-1); otherwise it is the larger of the neighbour above and the neighbour to the left. After the table is full, you walk backwards from the bottom-right corner, emitting an “equal” step on a diagonal match, a “removed” step when you move up, and an “added” step when you move left. Reverse the trail and you have an ordered list of diff hunks.

This is O(n × m) in time and memory, where n and m are the token counts. That is perfectly fine for the inputs a browser tool handles — a few thousand lines per side. Industrial tools like Git use Myers’ diff, a refinement that runs closer to linear time on similar files, but the output is the same kind of add/remove/equal hunk list. Understanding the LCS version is enough to reason about every diff you will ever read.

Line, word, and character granularity

The single most useful knob on a diff checker is the granularity — what counts as one token.

  • Line diff treats each line as a token. This is the default for code and configuration, because a line is the natural unit of a program. A changed line shows up as one removal plus one addition.
  • Word diff splits on whitespace, so each word is a token. This shines for prose: change “the quick brown fox” to “the slow brown fox” and a line diff flags the whole line, but a word diff isolates exactly quick → slow and leaves the rest untouched.
  • Character diff treats every character as a token. Use it for short strings, identifiers, or a single line where one character is the whole story — a flipped operator, a typo in a constant, a missing semicolon.

Choosing the right level is the difference between a diff that points at the change and one that buries it. The Diff Checker tool lets you switch levels instantly and recomputes the stats for each.

Normalisation: ignore whitespace and ignore case

Raw diffs are often too noisy to be useful, and the cause is almost always cosmetic. Two situations come up constantly:

Whitespace churn. A formatter re-indents a file, someone converts tabs to spaces, or an editor strips trailing whitespace on save. None of it changes behaviour, but a literal diff reports every touched line as changed. Ignore Whitespace fixes this by normalising each token — collapsing runs of spaces to one and trimming the ends — before comparison. The original text is still displayed exactly; only the comparison is relaxed. In a real review this can shrink a 300-line diff down to the three lines that actually matter.

Case differences. Comparing data that was lower-cased in one pipeline and not the other, or checking whether two strings match apart from capitalisation, produces spurious diffs. Ignore Case treats upper and lower case as equal for comparison purposes.

The important design principle is that normalisation affects comparison only, never the rendered or copied output. You always see your real text; you just control how strict the matching is.

Reading a side-by-side diff

A side-by-side, or split, view places the original on the left and the new version on the right, row-aligned. The alignment is what makes it readable: a removed line on the left sits opposite a blank on the right, and an added line on the right sits opposite a blank on the left, so unchanged content stays level across both panels. Green marks additions, red marks removals, and plain rows are unchanged. Line numbers on each side let you jump straight to the change in your editor.

Alongside the panels, a stats row answers the “how much changed?” question at a glance: counts of added, removed, and unchanged tokens, plus a similarity score. That score is simply the unchanged-token count divided by the larger of the two token counts, shown as a percentage — 100% means identical, 0% means nothing in common. It is a quick gut-check before you dive into the detail.

Where diffs fit in your workflow

Diffing is everywhere once you start noticing it. Code review is the obvious case — every pull request is a diff, and reading it well is a core engineering skill. But the same tool compares two config files to find environment drift, checks a generated file against a committed one in CI, compares expected versus actual test output to localise a regression, or tracks edits between two drafts of a document. If you are also cleaning up the structure of the files before comparing — say, normalising JSON so a diff is meaningful — pair this with our JSON formatter guide, which explains why consistent formatting makes diffs dramatically clearer.

Putting it together

A diff checker is a small idea executed precisely: tokenise both inputs at the granularity you care about, find the longest common subsequence, classify the rest as added or removed, and render it so a human can scan it in seconds. Add whitespace and case normalisation to strip the noise, and you have a tool that turns “what changed?” from a chore into a glance. Try it on your next pair of files with the Diff Checker and switch between line, word, and character modes to feel how each one reframes the same change.

Advertisement

Try Diff Checker — Free

Apply what you just learned with our free tool. No sign-up required.

Try Diff Checker

Frequently Asked Questions

What algorithm does a diff checker use?
Most diff tools, including ours, build on the Longest Common Subsequence (LCS). It finds the largest ordered set of tokens common to both versions, and everything outside that set is classified as added or removed. Production version-control systems often use Myers' refinement for speed on large files, but the result is conceptually the same.
Should I diff by line, word, or character?
Diff by line for code and config, where each line is a meaningful unit. Diff by word for prose and documentation, so a single reworded phrase does not flag the whole paragraph. Diff by character for short strings, identifiers, or single lines where one character matters.
Why does Ignore Whitespace matter so much in code review?
Reformatting, re-indentation, and trailing-space cleanup can rewrite hundreds of lines without changing behaviour. Ignoring whitespace collapses that noise so you only review edits that actually alter the code, which is the difference between a five-minute review and an hour of scrolling.

Was this guide helpful?

Your feedback helps us improve our content.

Get the best Developer Tools tips & guides in your inbox

Join 25,000+ users who get our weekly developer tools insights.