The Eval and the
Runbook: Forces 07–08

Forces 05 and 06 bring artifacts to production and route AI task payloads through the running system. Forces 07 and 08 start where the monitoring dashboard used to be enough. Force 07 adds the fourth layer to the test pyramid that nobody has built yet — AI evaluation. Force 08 asks what SRE looks like when the failure mode is "confidently wrong" rather than "unavailable" — and when the alert that should fire does not, because the system is behaving exactly as designed.

// TL;DR — what you'll take away
  • The test pyramid needs a fourth layer — AI evaluation — and almost nobody runs it in CI yet. The tooling exists; the adoption decision is the gap.
  • Your monitoring cannot see "confidently wrong." New SLOs — hallucination rate, retrieval precision floor, agent audit rate, cost-per-value — can.
  • Autonomous agent actions often cannot be rolled back. Prevention is the only strategy; rollback is a post-hoc audit.
// Companion overview
All 8 Forces Reshaping How Software Gets Built — reference card for the full landscape. This article covers Forces 07 and 08.
force 07
Force 07 / 08 Quality & Testing Critical Gap

The Unified Test Suite: Functional + Non-Functional + AI Eval

The traditional test pyramid has three layers. AI systems need a fourth — AI evaluation — which checks whether your AI feature is producing correct outputs, not just whether it is running. Adoption of this fourth layer in 2026 is low. Ownership is unclear. The failure mode is silent.

The traditional test pyramid was built for a world where software behaved deterministically. Given the same input, the same output emerged. Testing meant specifying inputs, asserting outputs, and confirming the gap between the two was zero. This is still true for the deterministic parts of your system. It is not sufficient for the AI parts.

A language model given the same input will produce semantically similar but textually different outputs across runs. Its response can be factually incorrect, contextually appropriate, confidently delivered, and completely undetected by every existing test in your suite. None of your current quality tooling was designed for this class of failure. Most teams are shipping AI features with a testing gap large enough to affect real users for days before anyone notices.

The Four-Layer Test Suite

L1
Functional Tests — Unit · Integration · Contract
Vitest · Playwright · Pact · Jest
L2
Non-Functional Tests — Performance · Security · Load
k6 · Locust · Snyk · OWASP ZAP
L3
Data Quality Tests — Schema · Freshness · Pipeline
Great Expectations · dbt tests · Monte Carlo
L4
AI Evaluation — Faithfulness · Relevance · Hallucination Rate
Ragas · DeepEval · LangSmith
Missing
L3 + L4 are the layers most teams are missing in 2026 →

Layers 1 and 2 are widely adopted and well-understood. Layer 3 — data quality testing — is implemented by teams with mature data engineering practices and largely absent everywhere else. Layer 4 — AI evaluation — is understood conceptually, has good tooling, and has almost no production adoption. Most teams that ship AI features have never run a single automated AI evaluation test in CI.

What Nothing Is Telling You

Load balancer
Knows
if your API response time is too slow
Error tracker
Knows
if your API is throwing exceptions or returning 5xx errors
Uptime monitor
Knows
if your endpoints are reachable and responding
Log aggregator
Knows
if unusual patterns appear in your structured output logs
Any of the above
Does not know
if your AI feature is producing confidently incorrect answers at 12% of queries
Any of the above
Does not know
if your RAG retrieval precision has degraded because the embedding index drifted
Any of the above
Does not know
if the context your AI is being given is three days stale and the answers reflect it

The last three rows are where your users are living right now if you have shipped an AI feature without Layer 4. The system is healthy by every metric your monitoring tracks. Users are receiving incorrect, outdated, or hallucinated responses. The symptom — bad AI output — and the cause — a data pipeline failure, a stale embedding index, a retrieval precision threshold that was never defined — are invisible to each other.

↳ see also · Article 06 — AI Adoption Is Not a Tooling Problem — healthy dashboards, wrong answers: this blind spot as a process failure.
// The critical gap
AI features are downstream of data pipelines. Data quality issues surface as AI quality issues — silently, in production — affecting real users for days before anyone connects the symptom to the cause. The data quality test suite is not a data engineering concern. It is an AI correctness concern. These two things belong in the same CI pipeline, owned by the same team, gating the same deployment.

The AI Evaluation Layer in Practice

The primitives for AI evaluation exist and are production-ready. Ragas and DeepEval provide the core metrics. LangSmith provides the tracking infrastructure and golden dataset management. The tooling is not the gap — the adoption decision is.

Metric
What it measures
Tool
Faithfulness
Does the response stay within the retrieved context? Or is the model adding content not present in the source?
Ragas
Relevance
Is the response actually answering the question asked? High relevance means the answer addresses the query; low relevance means it responds to a different question.
Ragas · DeepEval
Context Precision
Of the chunks retrieved for RAG, what fraction were actually relevant to the query? Low precision means noise is contaminating the context.
Ragas
Hallucination Rate
What percentage of responses contain statements that cannot be grounded in the context or source data? Tracked over time to detect model or data drift.
DeepEval · LangSmith
Context Recall
Was the relevant information in the corpus actually retrieved? Low recall means correct answers exist in your data but the retrieval system is not finding them.
Ragas
// The golden dataset problem
AI evaluation requires curated golden datasets: representative input–output pairs that define correct behaviour for your domain. These are not one-time artefacts — they require active maintenance. When the underlying model is updated, when the data the RAG system retrieves from changes, when the domain evolves, the golden dataset must be reviewed and updated to remain meaningful. Ownership of the golden dataset — which team maintains it, who defines "correct" for edge cases, how it is version-controlled — is an organisational question that must be answered before the technical tooling can do its job. Teams that skip this answer learn it the hard way: their AI eval CI job passes every PR because the golden dataset is two model versions out of date.
// Force 07 tools · 2026
Ragas DeepEval LangSmith Great Expectations dbt tests Vitest / Playwright / Pact k6 / Locust Snyk / OWASP ZAP
force 08
Force 08 / 08 Reliability & Platform Engineering Frontier

AI-Aware SRE: When Runbooks Meet Hallucination

SRE runbooks were designed for deterministic systems. AI systems are not deterministic. The on-call engineer in 2026 needs a new category of tool, a new category of SLO, and a new category of playbook — for failures that don't fire alerts.

Site Reliability Engineering was built on a foundational assumption: a healthy system is one where latency is within bounds, error rate is below threshold, and throughput is within capacity. Define those bounds, alert on breach, escalate, fix. The runbook captures what to do when any of those three numbers goes wrong.

AI systems introduce a failure mode that none of those three numbers captures: the system is fast, reliable, and operating at normal throughput — and it is giving users wrong answers. There is no alert for "confident incorrectness." There is no SLO for "degraded reasoning quality." The runbook has no entry for "the model started hallucinating entity names in user-visible summaries three days ago and nobody knew."

// The 3–6 month pattern
This failure hits every team shipping AI features — typically 3 to 6 months after launch. In the first weeks, the team monitors closely, quality is high, edge cases are rare. As time passes, novelty wears off, monitoring relaxes, and edge cases accumulate in production. The model encounters inputs it was not well-evaluated against. The embedding index has drifted from the live data. A configuration change altered retrieval behaviour without a corresponding eval run. By the time a user complaint surfaces, the degradation has been ongoing for weeks. The incident review cannot identify the start date because nothing in the monitoring recorded it.

New SLO Vocabulary for AI Systems

The answer is not more dashboards. It is defining what "healthy" means for an AI system — the equivalent of latency and error rate SLOs, applied to AI behaviour. These SLOs are not standard; every organisation must define them from the characteristics of their domain. But the categories are consistent:

// SLO type 01
Hallucination Rate SLO
The maximum acceptable percentage of responses that contain claims not grounded in the retrieved context or authoritative source data. Typically set per feature tier: higher tolerance for internal tools, lower for user-facing answers.
// Measured by: Ragas faithfulness + DeepEval hallucination scorer on production samples
// SLO type 02
Retrieval Precision Floor
The minimum acceptable ratio of relevant-to-retrieved chunks in your RAG pipeline. Below this floor, the AI is generating from noisy context and output quality will degrade regardless of model capability. Alert before it affects users.
// Measured by: Ragas context precision on a rolling sample of production queries
// SLO type 03
Agent Action Audit Rate
The percentage of autonomous agent actions that are logged with full traceability: task instruction, context window state at decision time, tool call made, result received. 100% is the only acceptable audit rate — partial traceability means partial rollback capability.
// Measured by: LangFuse / LangSmith trace completeness per agent session
// SLO type 04
Cost-Per-Delivered-Value
The LLM inference cost attributable to each unit of delivered value — a completed task, a resolved query, a summarised document. Rising cost-per-value (not total cost) is the early warning signal of degrading retrieval efficiency or model misuse patterns before they become incidents.
// Measured by: token cost per completed task, tracked by feature and model version
// SLO type 05
Business Outcome SLO
Did the AI feature achieve its business purpose? Task completion rate (did the agent successfully resolve the request?), user satisfaction on AI-assisted interactions, and AI-assisted resolution rate. These are distinct from Cost-Per-Delivered-Value — they measure whether value was actually delivered, not what it cost to deliver it. A low-cost workflow that produces wrong answers fails this SLO despite passing the cost SLO.
// Measured by: product analytics + user feedback signals + agent task outcome logs (Amplitude, Mixpanel, custom event telemetry)
// The on-call reality
On-call for an AI system in 2026 requires different debugging skills than traditional SRE. Latency profiles are normal. Error rates are normal. The investigation starts with: what changed in the data pipeline? When was the embedding index last rebuilt? Did the golden dataset evaluation pass on the last deployment? None of these questions are in a traditional runbook. Building the runbook for AI failure modes is the work of Force 08 — and most teams have not started it.

Autonomous Action Rollback: The Hard Constraint

// The one-way door problem
Traditional SRE assumes rollback is possible: revert the deployment, restore the database snapshot, re-route traffic to the previous version. Autonomous agent actions often cannot be rolled back. An email sent is sent. An API call to an external service completed. A payment processed. A record deleted. The SRE principle of reversibility — which underpins almost all traditional incident response — does not apply when an AI agent has taken an action in the world. This means the only viable strategy for autonomous agent failures is prevention: explicit approval gates before consequential actions, least-privilege tool access, and conservative action scopes that require human review for any action that cannot be undone. Rollback is not a recovery strategy for autonomous agents. It is a post-hoc audit.

Platform Engineering for Agent-Facing Reliability

Force 08 extends platform engineering into new territory. The 2026 platform team is building infrastructure not just for human-facing services, but for agent-facing reliability — the primitives that make AI systems observable, safe, and operable at 3am.

New primitive
MCP server health checks. Agents interact with tools via MCP servers. Each MCP server — filesystem access, database queries, external API calls — is a dependency that can fail or degrade. Health checks for MCP servers belong in the platform monitoring layer alongside service health checks, not in the agent code itself.
New primitive
Agent decision traceability. Every agent decision needs a trace: the context window at decision time, the tool call selected, the result received, and the next decision branch taken. This is not logging — it is structured causal tracing that enables a post-incident engineer to reconstruct the sequence that led to a bad outcome.
New primitive
Input/output guardrails. Guardrails AI and Llama Guard provide the filtering layer between user input and model prompt, and between model output and user display. PII detection, prompt injection detection, topic restriction, and output content validation. These are platform-level controls — not feature-level implementations.
New primitive
AI-aware alerting thresholds. The standard on-call alert fires on latency breach or error rate breach. AI-aware alerting fires on hallucination rate breach, retrieval precision degradation, cost-per-value drift, and golden dataset evaluation failure. These thresholds need to be configured in PagerDuty or equivalent — alongside, not instead of, the traditional SLOs.
Foundation layer
APM & cloud infrastructure monitoring. AI observability sits above the APM layer — not instead of it. Datadog APM, New Relic, or Grafana+Prometheus provide the infrastructure substrate: CPU, memory, network, and service throughput of everything running AI workloads. Cloud provider monitoring (CloudWatch for AWS, Azure Monitor, GCP Cloud Monitoring) covers the managed service health layer. Force 08's AI-specific signals extend above these layers — adding what APM cannot provide. When a hallucination spike coincides with a memory spike on the embedding service, you need both layers to find the cause.
// Force 08 tools · 2026
LangFuse LangSmith Arize Phoenix OpenTelemetry Datadog LLM Obs. Guardrails AI Llama Guard PagerDuty + custom SLOs Datadog APM Grafana + Prometheus CloudWatch / Azure Monitor
all 8 forces

The Cascade: Each Force Is the Ceiling for the Next

These eight forces are not independent. They form a cascade — each force upstream determines the quality ceiling for every force downstream.

F01Requirements
F02DDD / Domain
F03Six Tracks
F04Polyglot Data
F05CICD
F06Middleware
F07Test Suite
F08SRE
The quality of your requirements analysis determines the quality of your domain model. The quality of your domain model determines the quality of your AI-generated implementation. The quality of your data infrastructure determines the quality of your AI features. The quality of your monitoring determines how long you go blind when something goes wrong. Each force upstream is the ceiling of every force downstream.

This is why the forces series starts with requirements and DDD rather than tools and infrastructure. A team with excellent AI observability tooling and a shallow domain model will observe excellent-quality failures with great clarity. The hard problems stay hard from the top down.

The forces series maps the structural changes in how software is built, stored, shipped, and operated when AI is in the delivery chain. Article 11 is where all eight forces are applied simultaneously — phase by phase across the full software delivery lifecycle. Nine phases. Eight forces at each phase. Honest impact levels. The failures that happen at each intersection, and the tools that address them. It is a reference, not an argument. Keep it open in another tab.
// tool references last reviewed · June 2026