Five Stepping Stones to Systems That Last

Every system I have ever worked on eventually reached a moment of reckoning – the point where the decisions made at the beginning either held weight or began to collapse. After years of watching both outcomes unfold, I wrote down the five principles I return to every time I am building something that needs to last.

These are not architectural patterns. They are not framework choices or cloud configurations. They are the layer of thinking that sits above all of those – the lens through which good decisions about patterns and frameworks and configurations become possible. You can pick any technology stack and still get these right or wrong. The stack is only as good as the thinking behind it.

There is no problem that cannot be solved. There are only problems we have not yet decided to take seriously.

the five stones

// Stone 01

Right Mindset, Right Skillset

At a Japanese e-commerce giant I was handed a problem I had no obvious right to solve: lead the team rebuilding the search engine – while knowing almost nothing about search engines myself. I was expected to carry the domain, the tech stack, and the team all at once, and at some point my mind simply stopped. I was out of ideas. So I put it to the team plainly: if no existing tool or library will get us there, we build the thing ourselves. We failed at first. Then we began to succeed in small steps – and somewhere in there I learned the lesson this entire stone rests on.

The foundation of any sustainable system is the team building it – specifically, the team's willingness to accept difficult problems as genuine challenges rather than reasons to stop. Every hard architectural decision, every performance constraint, every non-functional requirement that looks impossible has one thing in common: it yielded to the right combination of mindset and skill. Not always the same people, not always the same approach – but always someone who believed the problem was solvable.

Think of your team's capability not as a fixed set of skills but as a skill spectrum: a distributed range of proficiency where different people anchor different areas of depth. The goal is not to find one person who knows everything – that person does not exist, and if they did, they would be a single point of failure. The goal is to build a spectrum with no critical gaps, and then to raise the floor of that spectrum continuously. When a new challenge arrives that your spectrum does not yet cover, the right response is not to declare it out of scope. The right response is to find the anchor closest to it and build from there.

Sustainable systems are built by teams that grow toward their problems. Not away from them.

// Stone 02

Readable, Testable, and Cost-Effective Code

Here is the trap I have watched good teams fall into. Java will compile your code whether it is good or bad – the compiler cannot tell the difference. The feature ships, it serves its purpose, everyone moves on. Then six months later someone needs to change it, the person who wrote it has left or moved up, and the only thing standing between that next developer and a week of pain is whether the code can be read and tested. Readability is not a courtesy to yourself. It is a bill you either pay now or hand, with interest, to whoever inherits the code.

There is a triangle every engineering team must navigate: the quality of the code you write, the cost of writing and maintaining it, and the speed at which you can ship. Pull any one of these too hard and the other two suffer. I have seen teams obsess over code quality to the point where nothing ships for months. I have also seen teams ship at breakneck speed until the codebase became unmaintainable and the cost of every new feature tripled.

The balance point is not static – it shifts with the stage of your product, the size of your team, and the maturity of your platform. But the direction is always the same: build only what is necessary, and build it as clearly as you can. Readability is not a luxury. Code that is difficult to read is code that is difficult to test, and code that is difficult to test is code that is expensive to change. That cost compounds every single quarter.

Cost-consciousness is increasingly non-optional. Platforms and services that cannot justify their infrastructure spend – compute, storage, egress – will not survive the coming decade. The engineer who thinks about the cost of their code while writing it is not being pessimistic. They are being professional.

// Stone 03

Trackable Errors, Recoverable Systems

In a distributed system of well-scoped, domain-driven microservices, an unhandled error does not announce itself – it vanishes. Without deliberate tracking, handling, and propagation, the failure is real but invisible: nothing in the logs, nothing in the monitoring. So I hold two numbers in my head. Traceability: from the moment an error occurs, you should be able to find it – and know which component produced it – inside a minute. And resilience: when a component does fail, the system heals itself or degrades gracefully, finishing the end-to-end journey and showing the customer something honest instead of going dark.

Two questions define the operational maturity of any production system. When something goes wrong in your code: how quickly can you fix it and deploy the fix? When something takes down your service: how quickly can you identify the failure and restore service? These are not rhetorical questions. They should have numbers attached to them – SLAs, SLIs, SLOs. If they do not, the system has no contract with itself.

For error traceability, the goal is a uniquely identifiable error that points back to its exact origin – the component, the layer, the line of code. If your error messages are generic, if your logs are inconsistent, if your correlation IDs do not survive hop boundaries, then your debugging process is archaeology. You are excavating context that should have been preserved.

For recovery, the goal is components and services that are self-sufficient and self-deployable. And if you are thinking about self-healing and self-recovery from the earliest design discussions – congratulations. That is exactly the right time to be asking those questions. The cross-cutting concerns that make this possible – consistent logging, transaction management, eventual consistency handling, standardised error reporting – are boring to implement and absolutely critical to get right. Systems that treat these as afterthoughts are systems that become difficult to operate at scale.

// Stone 04

Scale and Throughput by Design

Before I design anything, I run it past four benchmarks I treat as non-negotiable: throughput – the QPS and TPS it must sustain; load – how it holds as traffic grows; capacity – how it holds as the data grows; and scale – how it grows along both at once. Most of the real decisions – separating reads from writes, introducing a cache, choosing where state lives – fall straight out of answering those four honestly and early, rather than discovering them under load in production. I take this stone further in a companion piece on turning the four into measurable SLIs, SLOs and SLAs, with the commitments that keep a system honest as it grows.

Not every system needs to handle a million requests per second. But every system should know what it needs to handle – and the decisions about architecture, design, and technology stack should follow from that number, not precede it. When teams choose technologies before understanding their load profile, they are making bets without odds.

Scale thinking has two dimensions that are often conflated: volume (how much) and pattern (how it flows). A system that needs to process ten million events per day but can do so in batches overnight has a completely different architecture than one that must process those events within seconds of arrival. Identifying which of your data flows, task executions, and state transitions are truly synchronous – and which only feel synchronous because that was the path of least resistance – is one of the highest-leverage architectural questions you can ask.

Security belongs here, not as a separate conversation but as a dimension of the scale analysis. The security posture you need for a customer-facing API is different from what you need for an internal service. Mission-critical paths deserve different controls than administrative ones. A 360-degree view means understanding all of these facets and making explicit trade-offs, not discovering the conflicts later under pressure.

// Stone 05

Optimisation vs. Disruption

Imagine someone asks you to build an engine capable of 300 kilometres per hour, and what you currently have is a steam engine. You have two paths. You can optimise the steam engine – better pistons, higher pressure, improved thermodynamics – and push it as far as it will go. Or you can set the steam engine aside and build a bullet train. Both paths are valid. Neither is automatically correct. The question is which choice is sustainable and futuristic given your constraints.

I think of it as a gradient, not a switch. Push the steam engine and you might reach eighty. A diesel locomotive gets you to a hundred and fifty. Only an electric train reaches three hundred – and the architect's real skill is knowing, honestly, which rung the requirement actually needs, before you spend a year optimising a machine that was never going to get there. That judgement is two things wearing one coat: pragmatism – will this genuinely serve the customer, even when it is not the fashionable choice? – and trade-off analysis – what does each path cost, and which of those costs can we live with?

Most teams underestimate how often they are in the bullet-train situation. They have spent years incrementally improving a system that was architected for a different era, and the accumulated optimisations have made it slower, not faster, to change. The question to ask is not "can we improve this?" – you almost always can. The question is "should we improve this, or should we rethink it entirely?"

The answer requires architectural imagination: the ability to design systems that can accommodate decisions not yet made. You will not know every future requirement when you begin building. But you can build seams – deliberate boundaries between components where the greatest uncertainty lives – so that when those future decisions arrive, they land in places that were designed to absorb them. That is not over-engineering. That is the difference between a system that ages well and one that calcifies.

the thread between them

What All Five Have in Common

Read through the five stones again and you will notice a single thread running through all of them: the deliberate separation of what you are building today from what the system needs to remain trustworthy five years from now. The right mindset accepts that the future is uncertain but worth planning for. Readable, testable code gives your future self the ability to change things without fear. Trackable errors and recoverable systems turn operational failure from a crisis into a process. Scale thinking by design means the system does not surprise you when the traffic arrives. And the optimisation-versus-disruption question forces you to look up from the day-to-day and ask whether you are on the right path at all.

None of these is a one-time decision. All five require ongoing attention, ongoing honesty, and ongoing courage – the courage to acknowledge when the current approach has run out of road and something fundamentally different is needed.

Systems that last are not lucky. They are built by people who chose, at every significant decision point, to think one step further than was strictly required.

The teams that build sustainable systems are not the ones who know the most. They are the ones who ask the right questions early enough that the answers still have room to shape the architecture.

That window does not stay open forever.