The Black Box Mindset – Khurram Saleem

The page comes at 2 a.m. I have been the engineer who answers it, and I have been the engineer whose code change caused it. Both sides taught me something. Neither is comfortable to admit.

Over the years, I have watched memory leaks silently consume a service over hours. I have seen a single wrong cloud configuration take down a payment flow at peak traffic. I have witnessed a code change so innocuous it passed every review – and then detonated in production under a load no one had modelled. Every one of them ended the same way: smart people staring at dashboards, and someone on the other end of the system paying for a blind spot we did not know we had.

What I have come to believe is that the discipline of root cause analysis – done well – is one of the most undervalued engineering practices we have. And the best model I know for it does not come from software at all.

When a Plane Goes Down

Air crash investigation is, arguably, the most rigorous root cause analysis process ever designed. When something goes wrong in aviation, investigators do not accept surface explanations. They do not stop at the first plausible cause. They reconstruct the full sequence – mechanical, environmental, human – until they understand not just what failed, but why the system allowed it to fail.

The stakes in aviation are irreversible. A crash is not a degraded SLA. And yet, precisely because the stakes are so high, the aviation industry built something that most software teams have not: a culture of structured, blameless, evidence-driven investigation that leads to permanent systemic change.

Tracking down a production bug can feel as complex as investigating an air crash – and the discipline we bring to it should match.

The parallel is not perfect. Software outages do not claim lives. But the analytical rigour, the collaboration model, and the commitment to prevention over blame – those translate directly.

the analogy

Aviation

Aircraft Systems

Engines, avionics, hydraulics, flight controls – each examined independently, then as an integrated system.

Software

Distributed Systems

APIs, databases, middleware, message queues, CDN edges – each a potential failure domain, each a necessary line of investigation.

Aviation

The Black Box

Flight data recorder and cockpit voice recorder. Immutable, tamper-resistant, always-on. The ground truth of any investigation.

Software

Observability Stack

Structured logs, distributed traces, metrics. Only as useful as the discipline used to instrument them – and the culture that treats them as ground truth.

the four pillars

What Aviation Teaches Software Engineers

Having lived through enough incidents to know what separates a well-handled crisis from a prolonged one, I keep returning to four things. Not as a checklist – as a way of thinking.

// 01 – Integration Layers

Every Layer Is a Suspect

An aircraft is not its engines. It is the interaction between engines, avionics, controls, and the humans operating them. Investigators examine each system, then the boundaries between them. In software, the fault is almost never "the database." It is the connection pool under load, the retry logic that amplified the failure, the circuit breaker that was never tuned. The integration layer is where complexity hides – and where root causes live. Trace the full stack before forming a theory.

// 02 – Monitoring & Alerting

Your Logs Are the Black Box – Treat Them as One

Aviation's black box works because it is always on, captures everything, and cannot be switched off when things get uncomfortable. Most software monitoring is the opposite: instrumented for the happy path, silent about the edge cases, alerting only after the damage has been done. Real observability means knowing the health of your system before your users do. CPU spikes, memory climbing over six hours, error rate creeping above baseline – these are the pre-crash warnings. The black box mindset says: if it's not recorded, it didn't happen.

// 03 – Collaboration

No Single Expert Has the Full Picture

Air crash investigations convene pilots, mechanical engineers, meteorologists, human factors specialists, and air traffic controllers. Not because any one person's account is wrong – but because no single perspective is complete. The same is true of a production incident. The developer who wrote the change, the operations engineer watching the infra metrics, the QA who ran the test suite, the platform team who owns the orchestration layer – each holds a piece. The root cause almost always lives at the intersection. Silo-based incident response is the engineering equivalent of interviewing only the pilot.

// 04 – Speed With Structure

Fast Is Not the Same as Rushed

Aviation investigations can take months. We do not have that luxury – SLAs measure hours, sometimes minutes. But speed and rigour are not opposites. The goal is to detect fast, stabilise fast, and then – only then – investigate thoroughly. Rushed root cause analysis produces wrong root causes, and wrong root causes produce repeated incidents. The aviation principle that applies here is not the duration of the investigation. It is the early warning system: detect the anomaly before it becomes a crash, and you buy yourself time to respond with structure rather than panic.

the framework

Two Phases. Five Questions. One Discipline.

After years of being in and around production incidents, I have come to organise the whole problem around five questions – divided into two distinct phases that must not be collapsed into each other. Collapsing them is how teams fix the wrong thing and see the incident recur.

Phase 01 – Detect & Contain

What happened? When? Where?

What changed in the system or its environment?
When did the first anomaly appear in telemetry?
Where in the stack did it originate – and where did it propagate?
What is the blast radius right now?
What is the fastest path to stability for users?

→

Phase 02 – Understand & Prevent

Why did it happen? How do we prevent it?

Why did the system allow this failure mode to exist?
Why did our monitoring not catch it earlier?
Why did the safeguards not engage as expected?
What systemic change prevents recurrence?
What does the postmortem teach the whole team?

Phase one is about speed and blast radius. Phase two is about depth and permanence. The teams I have seen handle incidents best are the ones who are disciplined about keeping these phases separate – who resist the pressure to explain before they have stabilised, and who resist the temptation to close the ticket once stability is restored without doing the harder work of understanding why.

The fix that keeps the service up tonight is not the same as the fix that prevents this from happening again. Both matter. Neither replaces the other.

I have made expensive mistakes in production. Some of them I caught. Some of them caught me. What I learned from every one of them – and from watching how aviation treats failure – is that the quality of your investigation determines the quality of your next system. Build the black box. Run the investigation. Write the postmortem no one is afraid to read.