Inside the Language Model –
From Prompt to Answer

In Article 03, the search engine you already knew turned out to be the ancestor of nearly everything in modern AI. The vocabulary changed in 2017; the instincts did not. This is the other half: the machine itself. A prompt goes in as text, an answer comes out, and in between sit four layers and a lifecycle you can watch run, stage by stage, right here.

// in one breath
  • The four layers almost every AI deployment in 2026 sits inside – LLM, RAG, MCP, agents – and the one thing each one adds.
  • The AI lifecycle from prompt to answer: the same shape as search, with one genuinely new move – and every algorithm in it playable inline, from tokenisation to generation.
  • Where the machine stops being magic and turns into arithmetic – and what running it against real systems then demands, in Article 05.

You have already met the ancestors. Tokenisation grew out of query tokenisation, vector retrieval out of the inverted index, attention out of relevance ranking. What follows is what the 2017 architecture built on top of them – and the one capability search never had.

the stack

The Four Layers Built on 2017

The transformer paper opened a capability gap. The industry spent the next six years filling it. The result is a four-layer stack that almost every AI deployment in 2026 sits inside. Each layer adds exactly one thing. Understand what each adds, and you understand the architecture. Treat any layer as a black box, and you cannot debug it when it fails – and it will fail.

01
LLM – Large Language Model
Adds: text → meaning → text
The reasoning engine. Takes text in; produces text out. Understands context, follows instructions, generates content. Claude, GPT-4o, Gemini 2.5, Llama 3 – all LLMs. The core capability from the 2017 architecture shift. Alone, it is powerful and stateless. It knows only what you put in the prompt.
02
RAG – Retrieval-Augmented Generation
Adds: static knowledge → live retrieval
The memory layer. A standalone LLM only knows what it was trained on. RAG adds real-time retrieval: when a prompt arrives, relevant documents are fetched from a knowledge base – your codebase, your docs, your database – and injected into context. The model responds with current, specific information instead of generalised training data.
03
MCP – Model Context Protocol
Adds: isolated → tool-connected
The integration layer. Connects the LLM to external tools and systems – databases, APIs, file systems, version control, communication platforms. Without MCP (or an equivalent tool-calling layer), the model can only generate text about the world. With it, the model can act on the world. That distinction is the line between a chatbot and an agent.
04
Agents – Autonomous Task Execution
Adds: single-call → multi-step orchestration
The execution layer. Instead of a single prompt-response cycle, agents operate across multiple steps – planning a task, calling tools, evaluating intermediate outputs, correcting course, and continuing until a defined goal is met. An agent is not a smarter chatbot. It is a controlled task-execution loop. In 2026 the tools are Claude Code, GitHub Copilot Workspace, Cursor, Amazon Q Developer, and Gemini Code Assist. The underlying model barely matters – the orchestration does.
the lifecycle

That four-layer stack is a static picture. Here it is in motion – the same shape as the search pipeline, from your prompt to the answer, with each component doing its one job. Where a stage is an algorithm you can run, open its try it live panel and step through it yourself.

Context
Tokenize
Embed
Retrieve · RAG
Attend
Generate
Feedback Loop
PHASE A Comprehend // prompt → tokens → vectors → grounded context
A1

Assemble the context window

the model's whole world

Your message is never seen alone. It is stitched together with the system instructions and the chat history into one context window – the entire, finite input the model gets. Nothing is looked up in a database of stored answers.

PART 1

System prompt

The standing rules: who the model is and what it may do.

set by · the app
PART 2

Chat history

Everything said so far in this conversation, in order.

scope · the session
PART 3

Your message

The new prompt, the thing you just typed.

scope · this turn
your message»“explain attention simply”
A2

Tokenize

engine: BPE tokenizer

A model never sees letters. The text is split into tokens – subword pieces from a fixed vocabulary. Common words stay whole; rarer ones split into reusable pieces, and each token becomes an integer ID.

tokensexplainattentionsimply ids 25, 6817, 9760, 88
Try it live: tokenisationtext becomes numbers
A3

Embed

engine: embedding table

Each token ID is looked up and becomes a vector – a list of numbers – so meaning turns into geometry: similar ideas land near each other (king near queen). The very same vector space that powers semantic search.

each token → a vector[0.12, -0.41, …][0.90, 0.08, …][-0.33, 0.71, …]
Try it live: embeddingsmeaning becomes geometry
A4

Retrieve context · RAG (if grounded)

engine: vector search

The prompt's vector is matched against a vector database to pull the most relevant documents – your docs, your code, your knowledge base – and inject them into the context. This is search's retrieval step, reborn inside the model's input.

GROUNDED Documents found

The nearest passages are injected into the context window, so the answer is anchored to real sources, not just memory.

PARAMETRIC No retrieval

With no knowledge base attached, the model answers from its trained parameters alone – fast, but ungrounded.

injectedtransformer-paper.md · attention-explained.md
CONTEXT + MEANING READY ↓ feeds the transformer
PHASE B Generate // attention → next-token loop → tools → answer
B1

Attend

engine: transformer · self-attention

A word means nothing alone. The model weighs every token against every other, all at once, to work out what refers to what and what matters. This is the 2017 transformer doing its work – relevance scoring, the search engineer's craft, turned inward on the sentence.

weightsattention looks most at explain and the retrieved transformer-paper
Try it live: attentioncontext, computed
B2

Generate, token by token

★ the one new step engine: autoregressive loop

Now the model writes. It produces a probability over every token in its vocabulary, picks one, appends it, and runs the whole thing again – building the answer one piece at a time. A very large, very capable autocomplete. This is the stage search never had: a search engine ranks pages that already exist; this loop generates text that did not.

streaming“Attention lets each word look at the others …”
Try it live: next-token generationthe one new move
B3

Act with tools · MCP (if needed)

protocol: MCP

When the task needs more than text, the model calls tools – query a database, hit an API, edit a file. This is the line between a chatbot and an agent.

NEEDS TOOL Call out

The model emits a structured tool call, runs it, and feeds the result back into the context – then keeps generating.

TEXT ONLY Answer directly

If the answer is just language, no tool is called – generation continues straight to the reply.

tool callsearch_docs("self-attention")3 results
B4

Review & answer

render

For multi-step work, a planner → worker → reviewer loop iterates, and a human approves anything consequential before it ships. Then the finished answer streams to your screen, token by token.

delivered“Attention lets every word look at the others and decide which ones matter.”

The Agentic & Feedback Loop  // every answer can loop, and every preference trains the next model

A model is not one-and-done either. Within a task it can loop until the work is right; across millions of tasks, your reactions quietly shape the next version. This is the engine behind plan → act → review and the preference → training ladder.

L1
Plan → act → review

Agents break the task down, do a step, check the result, and loop until it holds.

L2
Human gate

A person approves anything consequential before it ships – the loop's safety valve.

L3
Stream to screen

The answer arrives live, token by token, instead of all at once.

L4
Preference → RLHF

Your thumbs-up / edit / retry feed the training data, so the next model gets better.

The Four-Layer Stack · what runs underneath
LLMthe model · transformer weights
RAGretrieval-augmented grounding
MCPtool-calling protocol
Agentsplan · act · review orchestration
Vector DBsemantic retrieval store
Context Windowthe finite input budget
Algorithms in play
BPE Tokenizertext → subword tokens
Embeddingstokens → vectors
Self-Attentioncontext, computed
Next-Token Predictionautoregressive generation
ANN Searchnearest-vector retrieval
KV-Cachefast repeated decoding
That is the machine itself: four layers, one lifecycle, and a single new move – generation – that search never had. None of it stays magic once you can point at each stage and run it. The harder question is what you are allowed to point it at: who stays in the loop, how to scope an agent so it cannot wander, and the seven things it cannot run without. That is Article 05.