I have been on both sides of a production incident. I have been the engineer paged at 2 a.m., and I have been the engineer whose code change caused the page. Both experiences taught me something important — and neither one is comfortable to admit.
Over the years, I have watched memory leaks silently consume a service over hours. I have seen a single wrong cloud configuration take down a payment flow at peak traffic. I have witnessed a code change so innocuous it passed every review — and then detonated in production under a load no one had modelled. The result is always the same: a nightmare for everyone involved.
What I have come to believe is that the discipline of root cause analysis — done well — is one of the most undervalued engineering practices we have. And the best model I know for it does not come from software at all.
When a Plane Goes Down
Air crash investigation is, arguably, the most rigorous root cause analysis process ever designed. When something goes wrong in aviation, investigators do not accept surface explanations. They do not stop at the first plausible cause. They reconstruct the full sequence — mechanical, environmental, human — until they understand not just what failed, but why the system allowed it to fail.
The stakes in aviation are irreversible. A crash is not a degraded SLA. And yet, precisely because the stakes are so high, the aviation industry built something that most software teams have not: a culture of structured, blameless, evidence-driven investigation that leads to permanent systemic change.
Tracking down a production bug can feel as complex as investigating an air crash — and the discipline we bring to it should match.
The parallel is not perfect. Software outages do not claim lives. But the analytical rigour, the collaboration model, and the commitment to prevention over blame — those translate directly.
Engines, avionics, hydraulics, flight controls — each examined independently, then as an integrated system.
APIs, databases, middleware, message queues, CDN edges — each a potential failure domain, each a necessary line of investigation.
Flight data recorder and cockpit voice recorder. Immutable, tamper-resistant, always-on. The ground truth of any investigation.
Structured logs, distributed traces, metrics. Only as useful as the discipline used to instrument them — and the culture that treats them as ground truth.
What Aviation Teaches Software Engineers
Having lived through enough incidents to know what separates a well-handled crisis from a prolonged one, I keep returning to four things. Not as a checklist — as a way of thinking.
Every Layer Is a Suspect
An aircraft is not its engines. It is the interaction between engines, avionics, controls, and the humans operating them. Investigators examine each system, then the boundaries between them. In software, the fault is almost never "the database." It is the connection pool under load, the retry logic that amplified the failure, the circuit breaker that was never tuned. The integration layer is where complexity hides — and where root causes live. Trace the full stack before forming a theory.
Your Logs Are the Black Box — Treat Them as One
Aviation's black box works because it is always on, captures everything, and cannot be switched off when things get uncomfortable. Most software monitoring is the opposite: instrumented for the happy path, silent about the edge cases, alerting only after the damage has been done. Real observability means knowing the health of your system before your users do. CPU spikes, memory climbing over six hours, error rate creeping above baseline — these are the pre-crash warnings. The black box mindset says: if it's not recorded, it didn't happen.
No Single Expert Has the Full Picture
Air crash investigations convene pilots, mechanical engineers, meteorologists, human factors specialists, and air traffic controllers. Not because any one person's account is wrong — but because no single perspective is complete. The same is true of a production incident. The developer who wrote the change, the operations engineer watching the infra metrics, the QA who ran the test suite, the platform team who owns the orchestration layer — each holds a piece. The root cause almost always lives at the intersection. Silo-based incident response is the engineering equivalent of interviewing only the pilot.
Fast Is Not the Same as Rushed
Aviation investigations can take months. We do not have that luxury — SLAs measure hours, sometimes minutes. But speed and rigour are not opposites. The goal is to detect fast, stabilise fast, and then — only then — investigate thoroughly. Rushed root cause analysis produces wrong root causes, and wrong root causes produce repeated incidents. The aviation principle that applies here is not the duration of the investigation. It is the early warning system: detect the anomaly before it becomes a crash, and you buy yourself time to respond with structure rather than panic.
Two Phases. Five Questions. One Discipline.
After years of being in and around production incidents, I have come to organise the whole problem around five questions — divided into two distinct phases that must not be collapsed into each other. Collapsing them is how teams fix the wrong thing and see the incident recur.
What happened? When? Where?
- What changed in the system or its environment?
- When did the first anomaly appear in telemetry?
- Where in the stack did it originate — and where did it propagate?
- What is the blast radius right now?
- What is the fastest path to stability for users?
Why did it happen? How do we prevent it?
- Why did the system allow this failure mode to exist?
- Why did our monitoring not catch it earlier?
- Why did the safeguards not engage as expected?
- What systemic change prevents recurrence?
- What does the postmortem teach the whole team?
Phase one is about speed and blast radius. Phase two is about depth and permanence. The teams I have seen handle incidents best are the ones who are disciplined about keeping these phases separate — who resist the pressure to explain before they have stabilised, and who resist the temptation to close the ticket once stability is restored without doing the harder work of understanding why.
The fix that keeps the service up tonight is not the same as the fix that prevents this from happening again. Both matter. Neither replaces the other.