← All insights

Operational design

The operational decay audit

6 min readBy Jayben Bertrand
Operations review session evaluating system health

Five questions that tell you whether your operational system is experiencing decay — and what the answers mean for your team.


Operational decay does not announce itself. There is no alarm that fires when your operating layer starts to erode. The system keeps running. The team keeps responding. And somewhere in the background, the accumulated cost of unrecorded decisions, outdated runbooks, and drifting configurations quietly compounds.

The audit below is not a scorecard. It is a diagnostic. Each question is designed to surface a specific failure pattern — the kind that feels manageable in isolation but becomes structural over time. Work through them honestly, and you will know more about the health of your operation than most post-incident reviews will tell you.


Question 1: Can a person who wasn't involved in the build diagnose a failure in under an hour?

This is the legibility test. A system designed to be operated makes its own state visible — not to the people who built it, but to whoever is on call at 2am on a Tuesday six months after the build team moved on.

If the answer is no, the gap is usually one of two things. Either the telemetry is insufficient — the system does not expose enough about its own state to support diagnosis — or the context is missing. The data exists, but the meaning does not. Thresholds were set for reasons no one has written down. Error codes map to conditions that only make sense if you were in the room when the edge case was first discovered.

What this means for your team: Start with the failure modes that have already occurred. For each one, write a one-page document: what the symptom looked like, what caused it, and what resolved it. That record is the foundation of diagnosability. It does not require new tooling. It requires thirty minutes and institutional honesty.


Question 2: When was your runbook last validated against actual system behaviour?

A runbook written at launch reflects the system as it was understood at launch. Every configuration change, firmware update, and architectural adjustment since then is a potential gap between what the runbook says and what the system actually does.

Most teams know their runbooks are out of date. Fewer know by how much. The honest answer to this question is often not a date — it is "we don't know," which is itself the answer.

What this means for your team: Runbook validity decays fastest in the sections that describe normal operation, because those are the sections teams stop reading once the system is stable. Failure response sections tend to get updated after incidents. Healthy-state definitions rarely do. Walk through your runbook's description of normal system behaviour and check it against current telemetry. The gaps you find are the gaps your team will be navigating blind during the next incident.


Question 3: Do you have a rollout mechanism that lets you stop at 10%?

This question is about blast radius. When a configuration change or firmware update goes wrong — and eventually one will — how much of your fleet is exposed before you can stop it?

If the answer is "all of it," you are not operating with a rollout model. You are operating with a binary: either the change works, or everything is affected simultaneously. That is not a risk management posture. It is an absence of one.

What this means for your team: A staged rollout does not require sophisticated infrastructure. It requires discipline about sequencing and a defined threshold for what constitutes a failed rollout. Start with cohorts: a small group of devices that receive changes first, a wait period, a specific set of signals that determine whether to proceed. The mechanism can be simple. What matters is that it exists and that the team uses it.


Question 4: Who owns the decision to roll back — and do they know it?

Incident response fails in two distinct ways. The first is technical: the team lacks the information or tooling to act. The second is organisational: the team has what it needs but does not have clear authority to use it.

The second failure is more common than most post-incident reviews acknowledge. In a degraded situation, the instinct is to escalate. Escalation takes time. Time, in a live incident, is the variable you most want to conserve. If the person on call needs to wait for approval before initiating a rollback, the cost of that wait is real — and it is paid by the system and the people depending on it.

What this means for your team: Define the rollback decision before the incident, not during it. Who can initiate a rollback? Under what conditions? What information do they need? The answers do not need to be elaborate. A single paragraph per critical component, kept somewhere the on-call team can find it, is enough to change the shape of an incident.


Question 5: Is your definition of "healthy" written down anywhere?

This is the quietest question, and often the most revealing. Most teams have an intuition about what healthy looks like. Fewer have written it down in a form that a new team member — or a tired team member at the end of a long shift — could use to make a confident assessment.

Without a written definition, "healthy" drifts. Thresholds that were set conservatively get normalised as the system ages. A device that would have triggered concern in the first month gets ignored in the twelfth because it has always looked like that. The definition of acceptable degrades to match observed reality, which is the operational equivalent of losing your baseline.

What this means for your team: For each critical component, write down what healthy looks like — not in terms of what the system is doing right now, but in terms of what it was designed to do. Error rate below X. Latency under Y. Last successful communication within Z minutes. Those numbers, kept current and visible, are what allow your team to distinguish a system that is running from a system that is running well.


What the audit tells you

If you worked through these five questions and most of your answers were confident and specific, your operating layer is in reasonable health. The gaps you found are the places to invest next.

If most of your answers were vague, qualified, or "we'd have to check" — that is not a crisis, but it is a direction. Operational decay is not reversed by a single intervention. It is reversed by the same mechanism that caused it: incremental decisions, made consistently, over time. The audit tells you where to start. The work is in the follow-through.


Field Operations is a series on the operational realities of running distributed systems in the field.