Silent Failures

The most expensive bug is not always the one that crashes. Crashes are rude, but at least they are honest: they interrupt you, they make noise, and they force somebody to look at whatever broke. The nastier bug is the one that fails and still reports success. Those are the ones I have lost the most time to, and almost none of them looked like failures in the moment they happened. Instead they looked like green builds, completed jobs, and finished reports, right up until I went looking for something and found it had quietly not been there for weeks.

A quick note before I get into it: I have spent the last few weeks writing about the big-picture stuff, AI governance, the ethics of pulling the ladder up behind us, the philosophical weather of all this, and I got a bit tired of it. So this week I wanted to come back down to the workshop floor and write about something concrete I actually deal with, hands on the tools. Silent failures are exactly that kind of problem.

This piece is mostly about that silent half, but it is worth saying up front that the silent failure has a louder twin, and they are two sides of one coin. At one extreme a system tells you nothing when something breaks. At the other it tells you everything, all the time, until the noise is so constant that you tune the whole channel out. They look like opposites, but they fail you the same way: you end up with a signal you do not act on. One you cannot see; the other you have trained yourself to ignore. The cure for both turns out to be the same discipline, and I will come back to it.

Four failures that all reported success

Let me start with real ones, because the abstract version of this argument is easy to nod along to and easy to forget. These are all mine, from systems I have built, and now run, with AI assistance (with the identifying details filed off).

The first was a coverage check. I had a CI step that measured test coverage on every change: it computed the number, uploaded it, produced an artefact, and went green. I read that green as "coverage is under control." It was not. The step measured coverage but never enforced a threshold, so code could land with no tests against it at all and the check still passed. It was running, but it was not checking. From the outside those two states are identical, which is the whole problem.

The second was worse, because it produced a confident wrong answer instead of a missing one. I had a function that checked eligibility against an external source, and when the lookup failed (a timeout, a refused connection) it returned false rather than raising or returning "unknown." Downstream, nothing could tell the difference between "I checked and the answer is no" and "I could not check." A network blip became a definite negative, and decisions got made on it. A crash there would have been a gift; instead the system manufactured a plausible lie and handed it onward with full confidence.

The third was a report assembled from several queries, each wrapped in its own error handler that returned empty on failure. If one of those sources was down, the report still rendered. It just quietly came back missing a chunk of its content, with nothing on the page to say a section was absent rather than genuinely empty. It looked complete. Someone relying on it had no way to know they were reading a partial answer dressed as a whole one.

The fourth is my favourite, in the way a particularly stupid scar is your favourite. I had an import job that, when it failed partway through, ran an error handler that reset its own progress counters to zero before logging. So the single number that would have told me how far it got (and therefore what state my data was now in, half-written) was destroyed by the exact code whose job was to handle the failure. The failure handler was itself a silent failure. That is the shape at its purest.

Four different bugs, four different mechanisms, yet all from one family:

A check that ran without checking.
An error coerced into a valid-looking value.
A partial result presented as a whole.
A failure handler that ate its own evidence.

None of them threw an actual error, and all of them reported success. Once you have been bitten by this failure shape a few times you start seeing it everywhere: the scheduled job that silently stopped running because nothing watches for its absence, the backup that "completes" nightly but has never once been restored from, the bug tracker with items marked resolved because a process step finished, not because the underlying code actually changed.

Why this costs more than a crash

The instinct is to rank bugs by blast radius: an outage in checkout is a five-alarm fire, a cosmetic glitch is nothing. That is about impact, and it is fine as far as it goes. It does, however, entirely miss a second axis that matters just as much: how long the bug gets to live before anyone knows it exists. Silent failures are catastrophic on that axis. A crash might be found in minutes. A silent failure, though, sits there for weeks, quietly creating fictional evidence, and by the time you find it you are no longer fixing one bug. You are now fixing the bug, the bad assumptions it seeded, all the work done under those assumptions, and your now-justified mistrust of the control that was supposed to catch it. When a feature breaks, you fix the feature. When a control breaks silently, you have to ask what else it waved through, which is a far uglier question to answer.

There's an old line about a defect costing a hundred times more to fix in production than in design. That exact multiplier gets repeated everywhere, usually with no recoverable primary source behind it [1], so I would not stake anything on the number. The direction is enough, and the direction is obviously true. A bug caught while you are writing the code is a small correction; the same bug caught after release is an investigation, a fix, a retest, sometimes a rollback, sometimes a customer conversation, and often a postmortem. The NIST report on inadequate software testing is old now but still useful for scale: it put the cost of inadequate testing infrastructure in the tens of billions a year to the US economy, with a large avoidable chunk available through earlier detection [2]. The number has aged but the lesson has not. Late detection is expensive, and a silent failure is late detection by design.

Controls fail too, and they fail quietly

The awkward part is that silent failures love to live inside the systems you built to prevent failure, and that should make any security or engineering person twitchy. A control is not magic. It is code, config, process, expectation, and habit, and every one of those can drift, break, or keep looking normal long after they stopped doing their job. This is the single most useful thing my audit background gave me: the habit of never confusing the existence of a control with the effectiveness of one. A policy can exist and not be followed, a review can happen and catch nothing, and a dashboard can be green and not mean what you think it means.

Software people are not immune from this, we just use different words to describe it. "The test exists." "The CI job ran." "The alert is configured." "The issue is marked fixed." None of those statements prove a control worked. They prove an artefact exists, which is not the same thing. The questions that actually matter are sharper: did the test fail when the behaviour was wrong, did the CI job block the bad change, did the eligibility check refuse to answer when it could not answer, did the fix actually remove the underlying problem? If the answer is no, or if the answer is "I assume so," the green light is decoration.

The control objective

Let's be clear, you cannot eliminate silent failure and that is the wrong target. The principle I keep coming back to is this: if a workflow depends on me remembering to check it, it is already broken. My attention is not a control. It is variable, interruptible, and exceptionally good at sliding past a familiar green box without registering it. Telling yourself to just "watch more carefully" is not a system, it is a promise to be less human next time, and I am not less human next time. The correct control objective is to make failure loud by default, so that the absence of my attention is safe.

In practice that is a handful of concrete habits, each one earned from a lesson above.

A check that finds a problem must fail, not warn. A scanner that finds a serious issue and exits zero is not a control, it is a diary entry.
An operation that cannot complete must raise that failure, not return a tidy default. When the answer is unknown, say "unknown" loudly, do not quietly substitute a confident "no" and carry on.
Failure handlers must preserve evidence, not destroy it. An error path that wipes the very counters that would tell you how bad the failure was is worse than no error path at all.
Absence needs watching as much as errors do, because a job that silently stops running produces no error to catch, only a slowly growing nothing.
Green must be expensive: a passing status should have to earn itself by proving the thing you care about is still true.

At the end of the day, if green is just the default colour when nothing complains loudly enough, it is lying to you.

But why do we end up with fail-but-still-green so often? Quite simply because people skip testing the controls themselves: not "does the pipeline run?" but "does the pipeline fail when I feed it the exact fault it is supposed to catch?"; not "do we have backups?" but "have we restored one?"; not "is there an alert?" but "does it fire when the symptom is real?" This is the very boring work. It is also the work that stops a bad day becoming a bad week becoming a bad month.

The louder twin

Here the other side of the coin comes back, because the most common way people create silent failures is by overcorrecting into the opposite problem. Frightened of missing something, you give everything a dashboard, every event a notification, every check a page, and after a while the system emits so much low-grade noise that the only rational human response is to ignore nearly all of it. That is not a personal failing, it is just the predictable result of bad signal design. And the cruel twist is that overload manufactures silent failures of its own, because a real alarm buried in a hundred routine ones is, for every practical purpose, an alarm nobody heard. The Chernobyl control room was famously a wall of alarms, and when everything was screaming at once nobody could tell which scream was the one that mattered. And we all know how that ended. Whilst a literal Chernobyl is probably not on the table for your CI pipeline, the metaphorical kind (the critical signal that was technically present and functionally invisible) can still wreck your week personally. Drown the signal and you have built a silent failure with extra steps.

The fix

The answer to both extremes is the same, and it is neither "more alerts" nor "fewer alerts", it is better ones. The Google Site Reliability Engineering guidance has the right pattern - alert on symptoms that matter, not on every twitchy internal measurement. Their four golden signals for user-facing systems are latency, traffic, errors, and saturation, but the specific list matters less than the rule behind it, which is to measure the things that tell you whether the system is actually doing its job [3]. Get that right and you slip between both failure modes at once: few enough signals that each one still earns attention, and meaningful enough that silence genuinely means things are fine. The same test applies to every control I named above. Do not ask only whether the check ran, ask whether it can still make a real decision. Do not ask only whether the tracker has items, ask whether "resolved" still means resolved. Do not ask only whether the dashboard is green, ask what would actually have to break for it to turn red. If that last question is hard to answer, the dashboard is probably just pretty, shiny lights, empty of meaning.

The standard I want

I do not want perfect systems. Perfect systems are mostly fictional, and chasing them generates its own special nonsense. I want systems that fail honestly. If a build is broken, fail the build. If a check has stopped checking, say so. If a scheduled job did not run, treat the absence as the failure it is. And if a control cannot demonstrate it is still doing its job, distrust it until it can. That sounds exhausting, but it is considerably less work, and far kinder than the alternative. Loud failures create a small attention bill now, silent failures create a bill so large that the cheques inevitably bounce. I would rather pay the small bill now, in a system that cries wolf occasionally, than the large one later, in a system that stayed silent while the sheep wandered off.

Sources

Morendil — The IBM Systems Science Institute — Investigation into the widely repeated defect-cost multiplier attributed to an IBM Systems Sciences Institute study, concluding that the primary source is not recoverable.
NIST Planning Report 02-3 — The Economic Impacts of Inadequate Infrastructure for Software Testing — 2002 RTI report for NIST estimating the national cost of inadequate software testing infrastructure and the potential reduction from better testing and earlier detection.
Google SRE Book, Chapter 6 — Monitoring Distributed Systems — Monitoring guidance including the four golden signals and the distinction between useful alerts and noisy measurements.