The Pipeline and the Cache: Forces 05–06

Chapter 11 of 18 Practitioner · 12 min

Writing the code feels like the finish line. It is closer to the starting gun – almost everything expensive happens after the commit lands. Forces 03 and 04 covered what comes before it: how code is structured across tracks, and how data is stored and retrieved. Forces 05 and 06 begin the moment that commit is in. Force 05 governs how five different artifact types travel from a developer's machine to production – and names the cross-artifact coordination problem that is genuinely unsolved in 2026. Force 06 governs the middleware carrying AI task payloads, where, in deployments I have seen, the gap between designing it well and ignoring it can come close to half the inference bill.

// the crux

Writing the code is the starting gun, not the finish line – almost everything expensive happens after the commit. The pipeline and the cache govern everything between commit and production, and a well-designed cache can come close to half the inference bill.

// in one breath

Why a single feature now ships as five different artifact types at once – and the release-coordination problem nobody has cleanly solved yet.
Infrastructure as the dependency-zero artifact – the case for treating it as a pipeline, not a console you click through under pressure.
Where a large slice of an inference bill quietly hides: the middleware layer most teams still treat as a footnote.

◉

// Companion overview

All 8 Forces Reshaping How Software Gets Built – reference card for the full landscape. This article covers Forces 05 and 06.

force 05

Force 05 / 08 CICD & Build Orchestration Unsolved

Five-Track Coordinated CICD Pipelines

Modern delivery produces five distinct artifact types: web, mobile, backend microservices, data pipelines, and AI/ML pipelines. Coordinating their build, test, and release as one coherent system – ensuring each artifact is valid before the next is promoted – is the industry's genuinely unsolved delivery problem in 2026.

For most of the last decade, "shipping" meant one thing: a container image went through CI, tests passed, and it landed in Kubernetes. One artifact type, one portable pipeline pattern – a largely solved problem. In 2026 a single feature routinely spans five artifact types with different build tools, test frameworks, deployment mechanisms, and rollback strategies – and the coordination between them is where engineering time now disappears.

The Five Artifact Types

Artifact 01

Web App

Next.js · Vercel · CDN

Fastest delivery. Deploy in seconds, roll back in seconds. No external gating.

Artifact 02

Mobile App

Fastlane · App Store · Play

The hard wall. 2–7 day review cycle regardless of how fast everything else moves. Cannot be accelerated.

Artifact 03

Backend Services

Docker · k8s · Helm

DDD microservices. Independent deployable but domain-coupled. Contract tests guard inter-service correctness.

Artifact 04

Data Pipelines

dbt Cloud · Airflow · Great Expectations

Schema migrations, transformation jobs, embedding generation. Failures here corrupt AI correctness downstream – silently.

Artifact 05

AI / ML Pipelines

MLflow · DVC · LangSmith

Model updates, vector index rebuilds, RAG pipeline changes. Validity depends on Artifact 04 being valid first.

One Feature, Five Builds

The practical consequence of five artifact types becomes visible the moment a feature crosses track boundaries – which AI features almost always do. Consider a common request: "Add semantic search to the product pages."

// Scenario: Add semantic search to the product pages

"Add semantic search to the mobile app and web – powered by our product catalogue."

Web

New search component, API calls to backend semantic endpoint. Build → test → deploy. Done in hours.

Mobile

New search screen, same API. Build → Fastlane → App Store submission.

⚠ 2–7 day review. Cannot be parallel with production rollout.

Backend

New semantic search service. Exposes vector retrieval endpoint consumed by both frontend tracks. Needs to be live before the web build ships.

Data Pipeline

Embedding generation job for the full product catalogue. Must complete and validate before the AI/ML pipeline has anything to serve. Schema contract with backend must hold.

AI / ML Pipeline

Vector index warm-up, similarity threshold tuning, RAG context configuration. Depends entirely on Data Pipeline artifact being valid and complete.

Each of these builds has passed its own tests. The failure mode is the dependency ordering: web ships before backend is live, AI pipeline is promoted before data pipeline validates, mobile users get the feature a week after web users but without a feature flag. All five artifacts were individually correct. The coordinated release was not.

// The unsolved problem

Turborepo handles monorepo coordination within a track. GitHub Actions handles orchestration between tracks. But the cross-artifact dependency graph for coordinated release – ensuring the data pipeline artifact is valid before the AI/ML pipeline artifact is promoted, ensuring the backend is live before the web build ships – is genuinely unsolved at the tooling level in 2026. Teams that crack this compound their delivery velocity dramatically. Most are building custom orchestration.

What You Can Control Now

The coordinated release problem is genuinely hard. Three practices compress delivery today, without waiting for the tooling to catch up:

Define the dependency graph explicitly. Even without tooling that enforces it, documenting which artifacts must be valid before others are promoted turns an implicit failure mode into an explicit checklist. Most post-incident reviews reveal that everyone assumed someone else was checking the dependency order.

Treat the data pipeline as a first-class artifact. It has its own CI (dbt Cloud with data lineage), its own tests (Great Expectations, dbt tests), and its own deployment gate. The AI/ML pipeline should have an automated check: "is the upstream data pipeline artifact passing?" If not, the AI pipeline should not promote. This single constraint eliminates the largest category of silent AI correctness failures.

Feature flags for coordinated cross-track releases. The mobile review wall does not go away. But you can ship the backend and AI pipeline first, ship the web frontend with a flag, submit the mobile build, and flip the flag when the mobile review clears. The coordinated release is assembled in production rather than in the pipeline. This is not elegant – it is the pragmatic answer to a constraint that has no tooling solution yet.

The Foundation Pipeline: Cloud Infrastructure

Before the first container can be deployed and before the first Kafka topic can be created, cloud infrastructure must exist. Most teams treat this as a one-time setup done by a senior engineer clicking through a cloud console. In 2026, infrastructure is a pipeline artifact with the same CI rights and responsibilities as application code – version-controlled, peer-reviewed, and applied through automation.

// Infrastructure as Code pipeline – Terraform / Terragrunt / Atlantis

Terraform + Terragrunt – provision + DRY multi-environment config

Terraform defines cloud resources declaratively. Terragrunt adds DRY configuration management across dev / staging / prod environments and multiple cloud accounts. Without Terragrunt, Terraform module proliferation across environments becomes the infrastructure equivalent of copy-paste programming – identical state backends, provider configs, and variable files repeated per environment with manual drift.

PR → terraform plan → Atlantis review → apply on merge

Every infrastructure change goes through version control. terraform plan runs in CI on every PR – showing exactly which cloud resources will change before any engineer approves. Atlantis (or GitHub Actions with OIDC) applies on merge: non-production environments auto-apply, production requires an explicit approval gate. The cloud console becomes read-only. If a resource is not in Git, it should not exist.

Cloud-native IaC alternatives per provider

AWS CloudFormation / CDK (TypeScript-first infrastructure code, native integration with IAM and service limits), Azure Bicep (ARM replacement, cleaner declarative syntax, first-class AzureAD integration), GCP Deployment Manager / Config Connector (Kubernetes-native resource management for GCP). The tooling choice matters less than the principle: every cloud resource is a code change that goes through CI.

→

Infrastructure is the dependency-0 artifact

EKS clusters, RDS instances, MSK Kafka brokers, VPC configurations, and IAM roles must be provisioned before any other pipeline artifact can deploy to them. Including the infrastructure pipeline in the Force 05 coordinated release model resolves a specific category of failure: application pipelines succeed, container images are built and tested, but there is nothing in the cloud to deploy to – or the resource exists but with the wrong configuration, discovered at 2am.

↳ industry alignment · AWS AI-DLC – its Operations phase treats IaC and deployment as AI-managed using context accumulated from earlier phases, with team oversight – the same dependency-0 principle.

// The AI-generated pipeline config opportunity

One place Force 05 does benefit from AI directly: pipeline configuration generation. GitHub Actions workflows, Helm chart scaffolding, Terraform module composition – these are high-structure, low-context tasks where AI produces reliable output. The caution: AI-generated pipeline config should be reviewed by someone who understands what it does, not just whether it runs. A pipeline that deploys in the wrong order or skips a contract test silently is worse than one that fails loudly.

// Force 05 tools · 2026

GitHub Actions Turborepo / Nx Fastlane dbt Cloud MLflow / DVC Buildkite ArgoCD Terraform Terragrunt Atlantis

force 06

Force 06 / 08 Middleware & Integration Underrated

Intelligent Middleware: Semantic Cache + Event Intelligence

Redis is no longer just a cache. Queues are no longer just buffers. In AI-augmented systems, the middleware layer has become the most important cost and latency lever – and consistently the least invested-in part of the architecture.

Force 06 is the most underrated force in the series – less discussed than vector databases and agent orchestration, and rarely on the architecture diagram in early design sessions. Yet at meaningful scale, the middleware design decision is worth more than any model optimisation you will do.

The 40–70% Lever

40–70%

The range of LLM inference cost reduction reported in enterprise semantic-caching deployments – and consistent with what I have seen. The exact number varies by workload; the mechanism does not. At the volume of queries a mid-size enterprise generates – support tickets, internal search, document summarisation, code review assistance – a significant percentage of queries are semantically equivalent even when the literal text differs. Semantic caching deduplicates them by meaning, not exact string match. Most teams are not doing this. They are paying per call.

↳ see also · Chapter 14 – AI Is Not Free – what that 40–70% looks like on a real enterprise and startup bill.

TTL Cache vs Semantic Cache

// What most teams do

TTL-Based Caching

Cache LLM responses by exact request hash. Expire after N minutes. A request for "summarise this contract" and "give me a summary of this contract" are two separate LLM calls, cached separately, billed separately.

Cache hit requires: exact same string → exact same cache key. Hit rate at production query diversity: low.

// What winning teams do

Semantic Caching

Embed the query. Retrieve semantically similar cached responses above a similarity threshold. "Summarise this contract" and "give me a summary of this contract" resolve to the same cached response if semantically above threshold.

Cache hit requires: semantic similarity above threshold → reuse cached response. Hit rate at production query diversity: in well-tuned deployments, 40–70% of calls never reach the model.

The invalidation challenge is real. A semantic cache entry is not stale based on time – it is stale when the underlying data that informed the response changes. For static content (documentation, policies, product specs), invalidating on document update is enough. For dynamic content, the invalidation logic is part of the cache design and must be defined upfront – teams that defer it consistently get it wrong in production.

Queue Design for AI Workloads

The second half of Force 06 is less visible but equally structural. When queues carry AI task payloads – not just simple message bodies, but agent instructions, conversation context, tool call sequences, and multi-step workflow state – the standard queue design assumptions break down.

Payload

AI tasks carry context, not just data. The payload includes the task instruction, relevant prior conversation state, tool permissions, and cost budget. Standard job queues were designed for lightweight serialised objects – not 4KB+ context payloads with structured metadata.

Priority

Real-time and batch AI tasks have fundamentally different latency requirements and cost profiles. A user-facing summarisation request (sub-2 second expectation) and a nightly document re-embedding job should not compete for the same queue depth. Priority lanes with separate consumer pools are a design requirement, not an optimisation.

Retry logic

Standard retry: wait and retry the same message. AI task retry: the failure context may have changed the optimal retry approach. A failed agent task that timed out on a complex step should retry with a simplified instruction, not an identical one. Retry-with-context is a design pattern most queue implementations require custom middleware to support.

Result cache

Completed AI task results should be cached at the queue level before the consumer processes them. If two different upstream services enqueue semantically equivalent tasks within a short window, only one LLM call should be made. This is queue-level deduplication by semantic similarity – the intersection of Force 06's two ideas.

// The dead letter queue insight

Standard DLQ thinking: failed messages are a problem to resolve. For AI tasks, the failure context is often more valuable than the retry. The context that caused the failure – the specific task instruction, the context window state, the tool call that timed out – directly informs how the next attempt should be structured. Teams that instrument their DLQs for AI tasks and surface the failure context to the next attempt see significantly higher resolution rates than teams that retry identically. The DLQ is not a dead end. It is a prompt engineering dataset.

Temporal for Durable Workflows

Multi-step agent workflows add a problem neither standard queues nor async/await patterns handle well: surviving infrastructure failures mid-execution. If an agent is partway through a 12-step document processing task and the worker crashes, the default outcome is re-running from step one – paying again for every step already completed.

Temporal addresses this directly: it provides durable workflow execution with guaranteed progress, activity retries with configurable policies, workflow versioning for long-running tasks, and visibility into workflow state at any point. For AI agent orchestration that involves expensive LLM calls, external API calls, and human review gates, Temporal is the infrastructure choice that makes multi-step reliability possible without building it from scratch.

// The design implication

Cache strategy and queue design are first-class architectural decisions in AI systems. Most teams still treat them as infrastructure footnotes – something the platform team handles. The gap between those two positions can be close to half your inference bill – and every multi-step agent workflow that fails mid-execution and restarts from step one.

// Force 06 tools · 2026

Redis + GPTCache BullMQ Kafka Temporal Dapr Upstash RabbitMQ

the connection

Force 05 and Force 06 Are the Same Layer

Both forces govern what happens between the developer and the user: Force 05 the delivery path from commit to production, Force 06 the execution path – how AI task payloads move, how responses are cached, how failures are handled.

They interact directly. A Force 05 coordinated release of a new AI pipeline artifact requires the semantic cache to be invalidated for affected response types – otherwise the new model behaviour is masked by cached responses from the previous version. A Force 06 queue priority lane design requires knowing which artifact types are user-facing (real-time) and which are background (batch) – information that comes from the Force 05 delivery model.

Teams that design Force 06 in isolation from Force 05 – that add semantic caching as a performance optimisation without connecting it to the deployment pipeline – consistently discover this the first time they ship a model update and users receive cached responses from the old version for hours.

// try it – a 20-minute lab

Measure a Semantic Cache Hit Rate

Collect thirty real queries from your logs or support inbox – or write three phrasings each of ten questions.
Embed each query; cache model responses keyed by embedding.
Set the similarity threshold to 0.92 and replay the set, counting cache hits versus model calls.
Lower the threshold to 0.85, replay again, and count how many reused answers are now subtly wrong.
Multiply the saved calls by your per-call cost – that is the lever at your volume.

// what you'll learn: the threshold is not a tuning detail – it is the product decision, traded between savings and correctness, made visible in twenty minutes.

Forces 05 and 06 bring the delivery and runtime layers into the architecture review. Both ask the same underlying question: what does the system need to do between the moment code is written and the moment a user receives a response?

// carry forward

The final two forces ask the inverse: how do you know the answer the user received was correct – and what happens to the system when it is not? That is Chapter 12: the eval and the runbook.

// tool references last reviewed · 2026