In Article 03, the search engine you already knew turned out to be the ancestor of nearly everything in modern AI. The vocabulary changed in 2017; the instincts did not. This is the other half: the machine itself. A prompt goes in as text, an answer comes out, and in between sit four layers and a lifecycle you can watch run, stage by stage, right here.
- The four layers almost every AI deployment in 2026 sits inside – LLM, RAG, MCP, agents – and the one thing each one adds.
- The AI lifecycle from prompt to answer: the same shape as search, with one genuinely new move – and every algorithm in it playable inline, from tokenisation to generation.
- Where the machine stops being magic and turns into arithmetic – and what running it against real systems then demands, in Article 05.
You have already met the ancestors. Tokenisation grew out of query tokenisation, vector retrieval out of the inverted index, attention out of relevance ranking. What follows is what the 2017 architecture built on top of them – and the one capability search never had.
The Four Layers Built on 2017
The transformer paper opened a capability gap. The industry spent the next six years filling it. The result is a four-layer stack that almost every AI deployment in 2026 sits inside. Each layer adds exactly one thing. Understand what each adds, and you understand the architecture. Treat any layer as a black box, and you cannot debug it when it fails – and it will fail.
That four-layer stack is a static picture. Here it is in motion – the same shape as the search pipeline, from your prompt to the answer, with each component doing its one job. Where a stage is an algorithm you can run, open its try it live panel and step through it yourself.
Assemble the context window
the model's whole worldYour message is never seen alone. It is stitched together with the system instructions and the chat history into one context window – the entire, finite input the model gets. Nothing is looked up in a database of stored answers.
System prompt
The standing rules: who the model is and what it may do.
set by · the appChat history
Everything said so far in this conversation, in order.
scope · the sessionYour message
The new prompt, the thing you just typed.
scope · this turnTokenize
engine: BPE tokenizerA model never sees letters. The text is split into tokens – subword pieces from a fixed vocabulary. Common words stay whole; rarer ones split into reusable pieces, and each token becomes an integer ID.
▶ Try it live: tokenisationtext becomes numbers
Embed
engine: embedding tableEach token ID is looked up and becomes a vector – a list of numbers – so meaning turns into geometry: similar ideas land near each other (king near queen). The very same vector space that powers semantic search.
▶ Try it live: embeddingsmeaning becomes geometry
Retrieve context · RAG (if grounded)
engine: vector searchThe prompt's vector is matched against a vector database to pull the most relevant documents – your docs, your code, your knowledge base – and inject them into the context. This is search's retrieval step, reborn inside the model's input.
The nearest passages are injected into the context window, so the answer is anchored to real sources, not just memory.
With no knowledge base attached, the model answers from its trained parameters alone – fast, but ungrounded.
Attend
engine: transformer · self-attentionA word means nothing alone. The model weighs every token against every other, all at once, to work out what refers to what and what matters. This is the 2017 transformer doing its work – relevance scoring, the search engineer's craft, turned inward on the sentence.
▶ Try it live: attentioncontext, computed
Generate, token by token
★ the one new step engine: autoregressive loopNow the model writes. It produces a probability over every token in its vocabulary, picks one, appends it, and runs the whole thing again – building the answer one piece at a time. A very large, very capable autocomplete. This is the stage search never had: a search engine ranks pages that already exist; this loop generates text that did not.
▶ Try it live: next-token generationthe one new move
Act with tools · MCP (if needed)
protocol: MCPWhen the task needs more than text, the model calls tools – query a database, hit an API, edit a file. This is the line between a chatbot and an agent.
The model emits a structured tool call, runs it, and feeds the result back into the context – then keeps generating.
If the answer is just language, no tool is called – generation continues straight to the reply.
Review & answer
renderFor multi-step work, a planner → worker → reviewer loop iterates, and a human approves anything consequential before it ships. Then the finished answer streams to your screen, token by token.
↻ The Agentic & Feedback Loop // every answer can loop, and every preference trains the next model
A model is not one-and-done either. Within a task it can loop until the work is right; across millions of tasks, your reactions quietly shape the next version. This is the engine behind plan → act → review and the preference → training ladder.
Plan → act → review
Agents break the task down, do a step, check the result, and loop until it holds.
Human gate
A person approves anything consequential before it ships – the loop's safety valve.
Stream to screen
The answer arrives live, token by token, instead of all at once.
Preference → RLHF
Your thumbs-up / edit / retry feed the training data, so the next model gets better.