Inside the Language Model – From Prompt to Answer

Chapter 4 of 18 Primer · 9 min

In Chapter 3, the search engine you already knew turned out to be the ancestor of nearly everything in modern AI. The vocabulary changed in 2017; the instincts did not. This is the other half: the machine itself. A prompt goes in as text, an answer comes out, and in between sit four layers and a lifecycle you can watch run, stage by stage, right here.

// the crux

A language model is text in, text out – with four layers (LLM, RAG, MCP, agents) and a lifecycle in between. Strip the mystique and it is arithmetic you can watch run, stage by stage. The one genuinely new move is generation; the rest you already met in search.

// in one breath

The four layers almost every AI deployment in 2026 sits inside – LLM, RAG, MCP, agents – and the one thing each one adds.
The AI lifecycle from prompt to answer: the same shape as search, with one genuinely new move – and every algorithm in it playable inline, from tokenisation to generation.
Where the machine stops being magic and turns into arithmetic – and what running it against real systems then demands, in Chapter 5.

You have already met the ancestors. Tokenisation grew out of query tokenisation, vector retrieval out of the inverted index, attention out of relevance ranking. What follows is what the 2017 architecture built on top of them – and the one capability search never had.

the stack

The Four Layers Built on 2017

The transformer paper opened a capability gap. The industry spent the next six years filling it. The result is a four-layer stack that almost every AI deployment in 2026 sits inside. Each layer adds exactly one thing. Understand what each adds, and you understand the architecture. Treat any layer as a black box, and you cannot debug it when it fails – and it will fail.

LLM – Large Language Model

Adds: text → meaning → text

The reasoning engine. Takes text in; produces text out. Understands context, follows instructions, generates content. Claude, GPT-4o, Gemini 2.5, Llama 3 – all LLMs. The core capability from the 2017 architecture shift. Alone, it is powerful and stateless. It knows only what you put in the prompt.

RAG – Retrieval-Augmented Generation

Adds: static knowledge → live retrieval

The memory layer. A standalone LLM only knows what it was trained on. RAG adds real-time retrieval: when a prompt arrives, relevant documents are fetched from a knowledge base – your codebase, your docs, your database – and injected into context. The model responds with current, specific information instead of generalised training data.

MCP – Model Context Protocol

Adds: isolated → tool-connected

The integration layer. Connects the LLM to external tools and systems – databases, APIs, file systems, version control, communication platforms. Without MCP (or an equivalent tool-calling layer), the model can only generate text about the world. With it, the model can act on the world. That distinction is the line between a chatbot and an agent.

Agents – Autonomous Task Execution

Adds: single-call → multi-step orchestration

The execution layer. Instead of a single prompt-response cycle, agents operate across multiple steps – planning a task, calling tools, evaluating intermediate outputs, correcting course, and continuing until a defined goal is met. An agent is not a smarter chatbot. It is a controlled task-execution loop. In 2026 the tools are Claude Code, GitHub Copilot Workspace, Cursor, Amazon Q Developer, and Gemini Code Assist. The underlying model barely matters – the orchestration does.

the lifecycle

That four-layer stack is a static picture. Here it is in motion – the same shape as the search pipeline, from your prompt to the answer, with each component doing its one job. Where a stage is an algorithm you can run, open its try it live panel and step through it yourself.

Context

Tokenize

Embed

Retrieve · RAG

Attend

Generate

Feedback Loop

PHASE A Comprehend // prompt → tokens → vectors → grounded context

Assemble the context window

the model's whole world

Your message is never seen alone. It is stitched together with the system instructions and the chat history into one context window – the entire, finite input the model gets. Nothing is looked up in a database of stored answers.

PART 1

System prompt

The standing rules: who the model is and what it may do.

set by · the app

PART 2

Chat history

Everything said so far in this conversation, in order.

scope · the session

PART 3

Your message

The new prompt, the thing you just typed.

scope · this turn

your message»“explain attention simply”

Tokenize

engine: BPE tokenizer

A model never sees letters. The text is split into tokens – subword pieces from a fixed vocabulary. Common words stay whole; rarer ones split into reusable pieces, and each token becomes an integer ID.

tokensexplainattentionsimply→ ids 25, 6817, 9760, 88

▶ Try it live: tokenisationtext becomes numbers

open full screen ↗

Embed

engine: embedding table

Each token ID is looked up and becomes a vector – a list of numbers – so meaning turns into geometry: similar ideas land near each other (king near queen). The very same vector space that powers semantic search.

each token → a vector→[0.12, -0.41, …][0.90, 0.08, …][-0.33, 0.71, …]

▶ Try it live: embeddingsmeaning becomes geometry

open full screen ↗

Retrieve context · RAG (if grounded)

engine: vector search

The prompt's vector is matched against a vector database to pull the most relevant documents – your docs, your code, your knowledge base – and inject them into the context. This is search's retrieval step, reborn inside the model's input.

GROUNDED Documents found

The nearest passages are injected into the context window, so the answer is anchored to real sources, not just memory.

PARAMETRIC No retrieval

With no knowledge base attached, the model answers from its trained parameters alone – fast, but ungrounded.

injectedtransformer-paper.md · attention-explained.md

CONTEXT + MEANING READY ↓ feeds the transformer

PHASE B Generate // attention → next-token loop → tools → answer

Attend

engine: transformer · self-attention

A word means nothing alone. The model weighs every token against every other, all at once, to work out what refers to what and what matters. This is the 2017 transformer doing its work – relevance scoring, the search engineer's craft, turned inward on the sentence.

weightsattention→ looks most at explain and the retrieved transformer-paper

▶ Try it live: attentioncontext, computed

open full screen ↗

Generate, token by token

★ the one new step engine: autoregressive loop

Now the model writes. It produces a probability over every token in its vocabulary, picks one, appends it, and runs the whole thing again – building the answer one piece at a time. A very large, very capable autocomplete. This is the stage search never had: a search engine ranks pages that already exist; this loop generates text that did not.

streaming“Attention lets each word look at the others …”

▶ Try it live: next-token generationthe one new move

open full screen ↗

Act with tools · MCP (if needed)

protocol: MCP

When the task needs more than text, the model calls tools – query a database, hit an API, edit a file. This is the line between a chatbot and an agent.

NEEDS TOOL Call out

The model emits a structured tool call, runs it, and feeds the result back into the context – then keeps generating.

TEXT ONLY Answer directly

If the answer is just language, no tool is called – generation continues straight to the reply.

tool callsearch_docs("self-attention")→3 results

Review & answer

render

For multi-step work, a planner → worker → reviewer loop iterates, and a human approves anything consequential before it ships. Then the finished answer streams to your screen, token by token.

delivered“Attention lets every word look at the others and decide which ones matter.”

↻ The Agentic & Feedback Loop // every answer can loop, and every preference trains the next model

A model is not one-and-done either. Within a task it can loop until the work is right; across millions of tasks, your reactions quietly shape the next version. This is the engine behind plan → act → review and the preference → training ladder.

Plan → act → review

Agents break the task down, do a step, check the result, and loop until it holds.

Human gate

A person approves anything consequential before it ships – the loop's safety valve.

Stream to screen

The answer arrives live, token by token, instead of all at once.

Preference → RLHF

Your thumbs-up / edit / retry feed the training data, so the next model gets better.

The Four-Layer Stack · what runs underneath

LLMthe model · transformer weights

RAGretrieval-augmented grounding

MCPtool-calling protocol

Agentsplan · act · review orchestration

Vector DBsemantic retrieval store

Context Windowthe finite input budget

Algorithms in play

BPE Tokenizertext → subword tokens

Embeddingstokens → vectors

Self-Attentioncontext, computed

Next-Token Predictionautoregressive generation

ANN Searchnearest-vector retrieval

KV-Cachefast repeated decoding

That is the machine itself: four layers, one lifecycle, and a single new move – generation – that search never had. None of it stays magic once you can point at each stage and run it.

// carry forward

The machine is demystified. The harder question is what you are allowed to point it at: who stays in the loop, how to scope an agent so it cannot wander, and the seven things it cannot run without. That is Chapter 5.