In the past couple of decades, automated testing of code has become commonplace, as has using tools to check which code has been covered by the tests.
This is far from perfect. Just because you tested a line of code by evaluating it in one test, does not mean that you tested it well or even correctly. Often, it takes several different tests of the same line of code to ensure that you’ve covered the most likely situations.
For example—and this is an absurdly simple example—if I have a divide
function:
function divide (x: number, y: number): number {
return x / y
}
And I write code that tests it like this:
import { strict as assert } from 'node:assert'
assert.equal(divide(6, 3), 2) // passes
Then my code coverage tool will report that the function has been tested.
But consider what happens if y
is 0
. We get Infinity
out. Is that something we wanted? As you can see:
typeof Infinity // returns "number"
It’s a number, so we won’t catch it in a type check. Maybe it’s OK, maybe it isn’t. But if it’s not OK and we need a guard of some sort, then we need more tests.
In short, just because we tested every line, every conditional, etc. does not mean that we’re completely covered.
But 100% is a start, right? Not to some…
The smart folks get it wrong
There are many smart developers who claim that 100% code coverage is too much. They say that we should only try for some lower number, maybe 80%? 70%?
They are wrong.
I understand why they say this, but they are making an error. What they mean is that some bits of code don’t need to be tested, and that trying to write tests for every line of code brings diminishing returns. That part is true.
I suspect that they base this on the Pareto Principle, namely that 20% of the work takes 80% of the effort. So if we cover the 80% that’s easy, then we’ve done it with only 20% of the effort it would take to get to 100%.
Frankly, I doubt that’s correct. But it is true that there are potentially diminishing returns as we approach 100% coverage.
I say “potentially” because it depends on the order in which you wrote your tests. If you started with the most difficult tests to write, then at 80% all you have left are the easiest ones.
So there is an assumption here that the 20% at the end are the ones that give the least bang for the buck—that you did the most important tests first, and that pushing further would only be for completion’s sake.
But there is a subtle but important distinction here that the smart devs are failing to make.
Measured coverage is not actual coverage
As I already explained above, having 100% measured coverage doesn’t mean that you’re covered for every possible event.
But there is another way that measured coverage can diverge from actual coverage: we can tell the coverage tool to ignore specific lines of code (or even whole files).
This gives the appearance of more coverage than we actually have. I can comment out 20% of the lines of code, and hit 100% measured coverage while actually only being at 80%.
Insanity, right? Why the hell would I do that?
It’s the psychology, stupid
Well, let’s take a look at what happens when we set an arbitrary limit on the minimum code coverage, such as 80%. (A lot of libraries would be doing great to get even 70%.)
Obviously, we have 20% of our potential coverage uncovered. Gasp! And the assumption is that these tests don’t really matter. After all, why would we risk our business by randomly leaving out key tests? It would be like Russian roulette.
But how do we know which lines of code are uncovered? Sure, the code coverage report tells us. Mine says lines 2, 4, 17-19, and 45-53. Ah, ha! I know exactly what those lines do!
Ha, ha. No, no I don’t. I’d have to go look at those lines of code and spend some time thinking about them to figure out whether they were important or not. In a large code base with 80% code coverage, hundreds or thousands of lines will not be covered. Who’s gonna check all that? (Not me.)
So if we’re going to test only to 80%, how do we decide what not to test?
Here’s an idea: we should set out to test every line of code, and then when we get to a line of code and we see that it is not worth testing, we should remove it from code coverage with an ignore comment. For example:
/* istanbul ignore next */
Now the Turkish people will look away, er, I mean the Istanbul code coverage tool will ignore the following line.
One great benefit of using comments in the code to remove things from the measured coverage is that you can search for them and quickly find what’s covered. If you add a reason for the ignore instruction in the same comment, then code reviewers can quickly understand what you’re doing and why.
Another benefit is that you know exactly which tests you’re removing and, hopefully, you only remove those that should be removed. It’s not arbitrary.
But there is yet another benefit.
Green means go, red means stop
Most people are binary thinkers, and developers are no exception. We like things all one way or all the other.
So the difference between 100% coverage in your coverage report (GREEN) and 99% coverage (RED) is the difference between 1 and 0, true and false, yes and NO.
You may tell yourself that this is not the case, but your brain doesn’t work that way. It sees green and 100% and it thinks “Score!” It sees red and any percentage less than 100% and it sees “FAIL!”
As a result, if you never have 100% coverage, then you never see a green report and your brain stops caring. Meh. Fail, fail, fail. What a loser I am!
There is no psychological and emotional reward for failure. This is an enormous difference.
But there’s more.
We do need some stinkin’ badges
Another big motivator is winning badges. We loves us some badges. We put badges on all sorts of things. It’s a bit infantile, but it’s true. You know it is.
It’s triple-A accessible. The CSS passes the validator with no errors. It’s standards compliant. We’re using the latest version. And so much more. Ooo, look! We got an award!
The green code coverage report is a kind of badge. Every time you run the coverage tool and it comes up green and 100%, you get a flush of euphoria and a feeling of success. Admit it. You know you do.
Hell, get four 100s in Lighthouse and it displays tiny little fireworks. Even Google knows that you need that reward. And the encouragement.
So why would we deny that to ourselves? That’s just psychologically stupid. Emotionally stupid. Frankly, it’s code stupid. Did I mention that it’s stupid?
Make your measure 100%
I am not recommending that we just comment out code coverage randomly until we see 100% just for that rush. That’s cheating. But it’s easy to get caught because we can all see who put that ignore instruction there (and they better have a good reason).
So the way I do it is I begin by writing all the tests I think I need. I focus on end-to-end tests, integration tests for most components (mocking only at the edges of the app), and a few unit tests for simple utility functions that are widely shared. And static types, of course.
When I think my coverage is pretty good, I run the code coverage tool, typically before I deploy code. I use the tool to track down the lines of code not covered.
If those lines need to be tested, then I write the tests no matter how troublesome. But if there is little point in testing them, I comment them out with an explanation.
I keep this up, running the coverage tool iteratively, until I see that lovely green 100%. Only then do I deploy.
Periodically, I inventory the ignore instructions and revisit them, asking if maybe I should have written a test.
And neither the actual test coverage percentage nor the lines of code untested are chosen arbitrarily, which, insanely, is what a lot of otherwise smart devs seem to be recommending. Just where did they get that arbitrary minimum number from? Is there any science to back it up? And is it really the same for every code base?
It’s not just code coverage
There are many situations where we need to have a simple pass/fail measure.
I recently posted an issue on a tool for testing accessibility on web pages. I highly recommend this tool, Deque’s axe DevTools. But I am less thrilled about how it reports “issues”.
Many aspects of accessibility cannot be automatically checked by the tool, but some can. After running an audit of a page, a number representing the issues found is displayed.
As with 100% code coverage, there is a psychological reward when this number comes up 0.
The problem is that they’ve mixed in issues which are clearly wrong with others that might be OK but require manual testing. This often means that you can never get that number to 0 unless you engage in some clumsy tricks to work around the uncheckable issues.
But if an issue is uncheckable automatically, then why report it as an issue? Why not have a second column for things to check manually? That way the “issues” number can be simple pass/fail. It’s either 0 or it isn’t. And make it BIG. And green.
If you look around, you’ll see that we often fail to take into account our desire to be number 1, not number 3, or to get a perfect score, not an almost-perfect score, or to go for the gold, not the bronze.
Let’s face it, in the minds of most people there is a bigger gap between winning gold and winning silver than there is between winning silver and winning no medal at all. Silver is impressive, but we can’t help thinking “also ran”.
We’re not likely to change our psychology soon, so maybe we should adapt our testing procedures to our psychology, instead of working against it.
Or maybe it’s just me.