The Eval and the Runbook: Forces 07–08

Chapter 12 of 18 Practitioner · 12 min

Forces 05 and 06 bring artifacts to production and route AI task payloads through the running system. Forces 07 and 08 start where the monitoring dashboard used to be enough. Force 07 adds the fourth layer to the test pyramid that nobody has built yet – AI evaluation. Force 08 asks what SRE looks like when the failure mode is "confidently wrong" rather than "unavailable" – and when the alert that should fire does not, because the system is behaving exactly as designed.

// the crux

The test pyramid needs a fourth layer – AI evaluation – and SRE needs a new failure mode: "confidently wrong" instead of "unavailable." The cascade is the point: each force is the ceiling for the next. You can only be as reliable as you can measure.

// in one breath

The fourth layer the test pyramid is still missing – the one that catches an AI feature being confidently wrong before your users do.
Why the dashboards stay green while an AI system quietly fails – and the new SLOs that can finally see it.
The uncomfortable truth about autonomous agents: some actions cannot be undone, which makes prevention the whole game.

◉

// Companion overview

All 8 Forces Reshaping How Software Gets Built – reference card for the full landscape. This article covers Forces 07 and 08.

force 07

Force 07 / 08 Quality & Testing Critical Gap

The Unified Test Suite: Functional + Non-Functional + AI Eval

The traditional test pyramid has three layers. AI systems need a fourth – AI evaluation – which checks whether your AI feature is producing correct outputs, not just whether it is running. Adoption of this fourth layer in 2026 is low. Ownership is unclear. The failure mode is silent.

The traditional test pyramid was built for a world where software behaved deterministically. Given the same input, the same output emerged. Testing meant specifying inputs, asserting outputs, and confirming the gap between the two was zero. This is still true for the deterministic parts of your system. It is not sufficient for the AI parts.

A language model given the same input will produce semantically similar but textually different outputs across runs. Its response can be factually incorrect, contextually appropriate, confidently delivered, and completely undetected by every existing test in your suite. None of your current quality tooling was designed for this class of failure. Most teams are shipping AI features with a testing gap large enough to affect real users for days before anyone notices.

The Four-Layer Test Suite

Functional Tests – Unit · Integration · Contract

Vitest · Playwright · Pact · Jest

Non-Functional Tests – Performance · Security · Load

k6 · Locust · Snyk · OWASP ZAP

Data Quality Tests – Schema · Freshness · Pipeline

Great Expectations · dbt tests · Monte Carlo

AI Evaluation – Faithfulness · Relevance · Hallucination Rate

Ragas · DeepEval · LangSmith

Missing

L3 + L4 are the layers most teams are missing in 2026 →

Layers 1 and 2 are widely adopted and well-understood. Layer 3 – data quality testing – is implemented by teams with mature data engineering practices and largely absent everywhere else. Layer 4 – AI evaluation – is understood conceptually, has good tooling, and has almost no production adoption. Most teams that ship AI features have never run a single automated AI evaluation test in CI.

Sizing the Functional Matrix: Orthogonal Arrays

Before the new layer, one correction to the oldest one. Layer 1 has a sizing problem that predates AI entirely. A feature with four configuration parameters, each taking three values, has 81 possible combinations. Nobody writes 81 test cases, so most teams sample by instinct, and coverage becomes whatever the last engineer felt like that afternoon. There has been a standard answer since the 1980s: orthogonal array testing, borrowed from Taguchi's design of experiments, derives the minimal balanced set of cases that still covers every pairwise interaction. For the same four parameters and three values, the array needs nine tests. Nine, not 81, and every pair of parameter values still appears together at least once.

The empirical backing is strong. NIST's software fault studies found that a majority of examined failures were triggered by a single parameter value, and up to 97 percent by the interaction of at most two, which is why pairwise coverage catches most of what exhaustive coverage would at a fraction of the count. Modern practice generalises the classical arrays into covering arrays (NIST's ACTS tool builds them for arbitrary parameter counts and strengths), but the discipline is unchanged: the test matrix is derived from the input structure, not improvised from habit.

The reason it belongs in this chapter: an orthogonal array is a computable standard, which makes it a natural scope for an agent. Hand a model the parameter table and the coverage strength, and it derives the matrix and generates the cases, plus the boundary cases the array deliberately leaves out. Test case selection stops being taste and becomes a deliverable a reviewer can check against a published method. That is the same move Chapter 6 argues for every phase: scoped, not constrained.

What Nothing Is Telling You

The last three rows are where your users are living right now if you have shipped an AI feature without Layer 4. The system is healthy by every metric your monitoring tracks. Users are receiving incorrect, outdated, or hallucinated responses. The symptom – bad AI output – and the cause – a data pipeline failure, a stale embedding index, a retrieval precision threshold that was never defined – are invisible to each other.

↳ see also · Chapter 6 – AI Adoption Is Not a Tooling Problem – healthy dashboards, wrong answers: this blind spot as a process failure.

// The critical gap

AI features are downstream of data pipelines. Data quality issues surface as AI quality issues – silently, in production – affecting real users for days before anyone connects the symptom to the cause. The data quality test suite is not a data engineering concern. It is an AI correctness concern. These two things belong in the same CI pipeline, owned by the same team, gating the same deployment.

The AI Evaluation Layer in Practice

The primitives for AI evaluation exist and are production-ready. Ragas and DeepEval provide the core metrics. LangSmith provides the tracking infrastructure and golden dataset management. The tooling is not the gap – the adoption decision is.

Metric

What it measures

Tool

Faithfulness

Does the response stay within the retrieved context? Or is the model adding content not present in the source?

Ragas

Relevance

Is the response actually answering the question asked? High relevance means the answer addresses the query; low relevance means it responds to a different question.

Ragas · DeepEval

Context Precision

Of the chunks retrieved for RAG, what fraction were actually relevant to the query? Low precision means noise is contaminating the context.

Ragas

Hallucination Rate

What percentage of responses contain statements that cannot be grounded in the context or source data? Tracked over time to detect model or data drift.

DeepEval · LangSmith

Context Recall

Was the relevant information in the corpus actually retrieved? Low recall means correct answers exist in your data but the retrieval system is not finding them.

Ragas

// The golden dataset problem

AI evaluation requires curated golden datasets: representative input–output pairs that define correct behaviour for your domain. These are not one-time artefacts – they require active maintenance. When the underlying model is updated, when the data the RAG system retrieves from changes, when the domain evolves, the golden dataset must be reviewed and updated to remain meaningful. Ownership of the golden dataset – which team maintains it, who defines "correct" for edge cases, how it is version-controlled – is an organisational question that must be answered before the technical tooling can do its job. Teams that skip this answer learn it the hard way: their AI eval CI job passes every PR because the golden dataset is two model versions out of date.

↳ industry alignment · AWS AI-DLC – left-shifts quality the same way: steering files carry org standards and compliance rules into every agent task, and its financial-services variant adds business-intent-to-code traceability for auditors.

// Force 07 tools · 2026

Ragas DeepEval LangSmith Great Expectations dbt tests Vitest / Playwright / Pact k6 / Locust Snyk / OWASP ZAP

force 08

Force 08 / 08 Reliability & Platform Engineering Frontier

AI-Aware SRE: When Runbooks Meet Hallucination

SRE runbooks were designed for deterministic systems. AI systems are not deterministic. The on-call engineer in 2026 needs a new category of tool, a new category of SLO, and a new category of playbook – for failures that don't fire alerts.

Site Reliability Engineering was built on a foundational assumption: a healthy system is one where latency is within bounds, error rate is below threshold, and throughput is within capacity. Define those bounds, alert on breach, escalate, fix. The runbook captures what to do when any of those three numbers goes wrong.

AI systems introduce a failure mode that none of those three numbers captures: the system is fast, reliable, and operating at normal throughput – and it is giving users wrong answers. There is no alert for "confident incorrectness." There is no SLO for "degraded reasoning quality." The runbook has no entry for "the model started hallucinating entity names in user-visible summaries three days ago and nobody knew."

// The 3–6 month pattern

This failure hits every team shipping AI features – typically 3 to 6 months after launch. In the first weeks, the team monitors closely, quality is high, edge cases are rare. As time passes, novelty wears off, monitoring relaxes, and edge cases accumulate in production. The model encounters inputs it was not well-evaluated against. The embedding index has drifted from the live data. A configuration change altered retrieval behaviour without a corresponding eval run. By the time a user complaint surfaces, the degradation has been ongoing for weeks. The incident review cannot identify the start date because nothing in the monitoring recorded it.

New SLO Vocabulary for AI Systems

The answer is not more dashboards. It is defining what "healthy" means for an AI system – the equivalent of latency and error rate SLOs, applied to AI behaviour. These SLOs are not standard; every organisation must define them from the characteristics of their domain. But the categories are consistent:

// SLO type 01

Hallucination Rate SLO

The maximum acceptable percentage of responses that contain claims not grounded in the retrieved context or authoritative source data. Typically set per feature tier: higher tolerance for internal tools, lower for user-facing answers.

// Measured by: Ragas faithfulness + DeepEval hallucination scorer on production samples

// SLO type 02

Retrieval Precision Floor

The minimum acceptable ratio of relevant-to-retrieved chunks in your RAG pipeline. Below this floor, the AI is generating from noisy context and output quality will degrade regardless of model capability. Alert before it affects users.

// Measured by: Ragas context precision on a rolling sample of production queries

// SLO type 03

Agent Action Audit Rate

The percentage of autonomous agent actions that are logged with full traceability: task instruction, context window state at decision time, tool call made, result received. 100% is the only acceptable audit rate – partial traceability means partial rollback capability.

// Measured by: LangFuse / LangSmith trace completeness per agent session

// SLO type 04

Cost-Per-Delivered-Value

The LLM inference cost attributable to each unit of delivered value – a completed task, a resolved query, a summarised document. Rising cost-per-value (not total cost) is the early warning signal of degrading retrieval efficiency or model misuse patterns before they become incidents.

// Measured by: token cost per completed task, tracked by feature and model version

// SLO type 05

Business Outcome SLO

Did the AI feature achieve its business purpose? Task completion rate (did the agent successfully resolve the request?), user satisfaction on AI-assisted interactions, and AI-assisted resolution rate. These are distinct from Cost-Per-Delivered-Value – they measure whether value was actually delivered, not what it cost to deliver it. A low-cost workflow that produces wrong answers fails this SLO despite passing the cost SLO.

// Measured by: product analytics + user feedback signals + agent task outcome logs (Amplitude, Mixpanel, custom event telemetry)

// The on-call reality

On-call for an AI system in 2026 requires different debugging skills than traditional SRE. Latency profiles are normal. Error rates are normal. The investigation starts with: what changed in the data pipeline? When was the embedding index last rebuilt? Did the golden dataset evaluation pass on the last deployment? None of these questions are in a traditional runbook. Building the runbook for AI failure modes is the work of Force 08 – and most teams have not started it.

Autonomous Action Rollback: The Hard Constraint

⚠

// The one-way door problem

Traditional SRE assumes rollback is possible: revert the deployment, restore the database snapshot, re-route traffic to the previous version. Autonomous agent actions often cannot be rolled back. An email sent is sent. An API call to an external service completed. A payment processed. A record deleted. The SRE principle of reversibility – which underpins almost all traditional incident response – does not apply when an AI agent has taken an action in the world. This means the only viable strategy for autonomous agent failures is prevention: explicit approval gates before consequential actions, least-privilege tool access, and conservative action scopes that require human review for any action that cannot be undone. Rollback is not a recovery strategy for autonomous agents. It is a post-hoc audit.

Platform Engineering for Agent-Facing Reliability

Force 08 extends platform engineering into new territory. The 2026 platform team is building infrastructure not just for human-facing services, but for agent-facing reliability – the primitives that make AI systems observable, safe, and operable at 3am.

New primitive

MCP server health checks. Agents interact with tools via MCP servers. Each MCP server – filesystem access, database queries, external API calls – is a dependency that can fail or degrade. Health checks for MCP servers belong in the platform monitoring layer alongside service health checks, not in the agent code itself.

New primitive

Agent decision traceability. Every agent decision needs a trace: the context window at decision time, the tool call selected, the result received, and the next decision branch taken. This is not logging – it is structured causal tracing that enables a post-incident engineer to reconstruct the sequence that led to a bad outcome.

New primitive

Input/output guardrails. Guardrails AI and Llama Guard provide the filtering layer between user input and model prompt, and between model output and user display. PII detection, prompt injection detection, topic restriction, and output content validation. These are platform-level controls – not feature-level implementations.

New primitive

AI-aware alerting thresholds. The standard on-call alert fires on latency breach or error rate breach. AI-aware alerting fires on hallucination rate breach, retrieval precision degradation, cost-per-value drift, and golden dataset evaluation failure. These thresholds need to be configured in PagerDuty or equivalent – alongside, not instead of, the traditional SLOs.

Foundation layer

APM & cloud infrastructure monitoring. AI observability sits above the APM layer – not instead of it. Datadog APM, New Relic, or Grafana+Prometheus provide the infrastructure substrate: CPU, memory, network, and service throughput of everything running AI workloads. Cloud provider monitoring (CloudWatch for AWS, Azure Monitor, GCP Cloud Monitoring) covers the managed service health layer. Force 08's AI-specific signals extend above these layers – adding what APM cannot provide. When a hallucination spike coincides with a memory spike on the embedding service, you need both layers to find the cause.

// Force 08 tools · 2026

LangFuse LangSmith Arize Phoenix OpenTelemetry Datadog LLM Obs. Guardrails AI Llama Guard PagerDuty + custom SLOs Datadog APM Grafana + Prometheus CloudWatch / Azure Monitor

all 8 forces

The Cascade: Each Force Is the Ceiling for the Next

These eight forces are not independent. They form a cascade – each force upstream determines the quality ceiling for every force downstream.

F01Requirements

→

F02DDD / Domain

→

F03Six Tracks

→

F04Polyglot Data

→

F05CICD

→

F06Middleware

→

F07Test Suite

→

F08SRE

The quality of your requirements analysis determines the quality of your domain model. The quality of your domain model determines the quality of your AI-generated implementation. The quality of your data infrastructure determines the quality of your AI features. The quality of your monitoring determines how long you go blind when something goes wrong. Each force upstream is the ceiling of every force downstream.

This is why the forces series starts with requirements and DDD rather than tools and infrastructure. A team with excellent AI observability tooling and a shallow domain model will observe excellent-quality failures with great clarity. The hard problems stay hard from the top down.

// try it – a 20-minute lab

Your First Golden Dataset in CI

Write ten question–answer pairs your AI feature must get right – real ones, from your domain.
Install Ragas or DeepEval and wrap your RAG or agent endpoint.
Score faithfulness and relevance across the ten pairs; record that as your baseline.
Add it as a CI job that fails when either score drops more than five points below baseline.
Change a prompt and re-run – watch CI catch (or pass) the drift you just introduced.

// what you'll learn: AI evaluation stops being abstract the first time CI blocks a regression you caused on purpose.

The forces series maps the structural changes in how software is built, stored, shipped, and operated when AI is in the delivery chain. The cascade is the point: each force is the ceiling for the next – you can only be as reliable as you can measure.

// carry forward

Chapter 13 is where all eight forces are applied at once – phase by phase across the full delivery lifecycle. Nine phases, eight forces each, honest impact levels, and the failure at every intersection. A reference, not an argument. Keep it open in another tab.

// tool references last reviewed · 2026