Run AI Locally: The Complete Cookbook

The Series

Seven Sections. One Complete Picture.

Each episode is a standalone guide. Together they take you from zero – never run a local model – to being the person in your team who actually understands how local AI works.

Section 01 · Learn.

Why Local AI? + Hardware Guide

Which model fits which machine. The hardware matrix every engineer needs before downloading anything.

01

Section 02 · Learn.

The Stack That Never Changes

Ollama + Open WebUI – the two tools that underpin every single use case in this series. Set up once, use forever.

02

Section 03 · Excel.

Your Brain in a Box – RAG Setup

Upload your documents, notes, and PDFs. Ask questions. Get answers that cite your own knowledge base.

03

Section 04 · Excel.

Your AI Pair Programmer

Set up a local coding assistant in VS Code. Review code, generate functions, debug errors – all offline.

04

Section 05 · Excel.

AI That Does Things – Agents

Connect your model to tools. File systems, APIs, web search. Agentic workflows that actually run tasks.

05

Section 06 · Lead.

Making It Yours – Fine-tuning

Train a model on your data. Your writing style. Your domain. Your brand voice. QLoRA on free GPU.

06

Section 07 · Lead.

From Hobbyist to AI Engineer

The career roadmap. What skills open which doors. How local AI expertise translates into real leverage.

07

Section 01 · Hardware Guide

Which Machine Can Run What

Five tiers. Every common hardware profile covered. Pick your tier and see exactly which models run, what use cases unlock, and what you can't do yet.

RAM

4 – 8 GB

System RAM, no GPU

Storage Free

50 GB+

For model files

Profile

Budget Laptop

2–4 GB available for LLM

Gemma 4 E2B

~1.5 GB RAM (Q8)

Best FitMultimodal

Qwen3:1.7B

~1.1 GB RAM (Q4)

Fastest

Phi-4-mini (3.8B)

~2.5 GB RAM (Q4)

Good Reasoning

Gemma 4 E4B

~3.5 GB RAM (Q4)

Multimodal

✅

Basic chat and Q&A – conversations, explanations, summaries at 2–4 tokens/sec

✅

Simple text generation – short articles, email drafts, basic code snippets

✅

Image understanding (E2B/E4B) – describe images, read charts, basic OCR

⚠️

Small RAG queries – possible with <20 short documents, slow ingestion

❌

Complex reasoning, multi-step coding, long documents – model too small, context too short

RAM

16 GB

System RAM, no GPU

Storage Free

100 GB+

For model files

Profile

Entry Laptop/Desktop

8–10 GB available for LLM

Phi 4 (14B)

~8 GB RAM (Q4)

Best FitStrong Reasoning

Qwen3:7B

~4.5 GB RAM (Q4)

Fast

Mistral 7B

~4.5 GB RAM (Q4)

Solid All-Round

Gemma 4 E4B

~3.5 GB RAM (Q4)

Vision

✅

Good coding assistance – Phi 4 rivals GPT-3.5 on code tasks at this size

✅

Document summaries and analysis – handles 10K–20K token documents comfortably

✅

Basic RAG pipeline – 50–100 documents, Open WebUI built-in RAG works well

⚠️

Complex multi-step agents – possible but slow, model may lose context on long chains

❌

70B+ models, vision on large images – not enough RAM for larger quantizations

RAM

32 GB

System RAM, no GPU

Storage Free

200 GB+

Multiple models

Profile

Workhorse (Your Setup)

18–22 GB available for LLM

Gemma 4 26B A4B

~18 GB RAM (Q4)

Best FitVision + Agents

Qwen3:14B

~9 GB RAM (Q4)

Faster Option

Phi 4 (14B)

~8 GB RAM (Q4)

Reasoning

Mistral Small 3.1

~12 GB RAM (Q4)

Vision

✅

Advanced reasoning and complex Q&A – 26B A4B performs close to frontier models on reasoning benchmarks

✅

Full RAG pipeline – hundreds of documents, semantic search, cited answers from your knowledge base

✅

Vision tasks – Gemma 4 and Mistral Small 3.1 both handle image + text natively

✅

Agentic workflows with tool calling – Gemma 4 has native function calling, works with n8n and Open WebUI Pipelines

✅

256K token context – process entire codebases, long reports, multi-chapter documents in one pass

⚠️

Speed – CPU inference runs at 3–8 tokens/sec. Usable, not instant. Background tasks are fine; real-time chat feels slightly slow.

GPU VRAM

8 – 16 GB

RTX 3080/4070/4080

System RAM

32 GB+

GPU offloads to RAM if needed

Profile

Gaming PC / ML Workstation

20–50× faster than CPU only

Gemma 4 31B

~20 GB VRAM (Q4)

Best Quality

Qwen3:30B

~18 GB VRAM (Q4)

Best Coding

Llama 4 Scout

~16 GB VRAM (Q4)

Long Context

StarCoder2-15B

~9 GB VRAM (Q4)

Code-Specialist

✅

Everything in T2, 20–50× faster – real-time chat speeds, 30–80 tokens/sec

✅

QLoRA fine-tuning – fine-tune 7B models locally, 13B possible on 16GB VRAM

✅

Multiple concurrent models – run Ollama with 2 models loaded simultaneously

✅

Production-grade inference – can serve as a real local API for small teams

GPU VRAM

24 GB+

RTX 4090 / A100 / H100

System RAM

64 GB+

For large model offload

Profile

Serious ML Setup / Cloud VM

Near-frontier local performance

Qwen3:72B

~42 GB (Q4, split GPU+RAM)

Best Quality

DeepSeek V3

Requires multi-GPU or MoE

Frontier-Level

Llama 4 Scout Full

~24 GB VRAM (Q4)

10M Context

StarCoder2-33B

~20 GB VRAM (Q4)

Best Code Model

✅

Near-frontier quality – Qwen3:72B rivals GPT-4 on most benchmarks

✅

Full fine-tuning of 7B–13B models – LoRA or full parameter training locally

✅

Production API serving – serve your own OpenAI-compatible endpoint for your team

✅

QLoRA of 30B–70B models – serious research and domain adaptation

Section 02 · The Stack

The Setup That Never Changes.

Every use case in this series builds on the same two tools. Install this once. Every playbook below assumes you have these running.

01

Install Ollama – Your Model Server

Ollama downloads and runs models locally. It exposes an OpenAI-compatible API at localhost:11434. Every other tool connects to it.

bash

# Linux / Mac
curl -fsSL https://ollama.com/install.sh | sh

# Windows → download installer from ollama.com

# Pull your first model (T2 recommendation)
ollama pull gemma4:26b-a4b
ollama pull phi4
ollama pull qwen3:14b

# List available models
ollama list

02

Launch Open WebUI – Your Chat Interface

A full ChatGPT-like interface that connects to Ollama. Runs in your browser. Supports RAG, image upload, conversation history, user management, and more.

bash (Docker)

docker run -d -p 3000:80 \
  -v open-webui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

# Open browser at:
# http://localhost:3000

💡

No Docker? Use the Python install: pip install open-webui && open-webui serve

03

Enable RAG – Your Knowledge Base

Open WebUI has built-in RAG. No extra tools needed for basic use. Upload documents, then reference them in chat.

open webui steps

# In the Open WebUI sidebar:
1. Go to Workspace → Documents
2. Upload PDFs, DOCX, TXT, MD files
3. In chat, type # to reference a document
4. Or enable "RAG Mode" to always search docs

# Advanced: AnythingLLM for full local RAG
docker run -d -p 3001:3001 \
  -v anythingllm:/app/server/storage \
  --name anythingllm \
  mintplexlabs/anythingllm

04

Optional: Add n8n for Agentic Workflows

When you want your AI to take actions – send emails, write files, search the web, call APIs – connect n8n to Ollama.

bash

docker run -d -p 5678:5678 \
  -v n8n_data:/home/node/.n8n \
  --name n8n \
  n8nio/n8n

# Open n8n at http://localhost:5678
# Add Ollama node → connect to localhost:11434

Your Complete Stack

Your Chat Interface

Open WebUI – localhost:3000

↕

Your Knowledge Base

Open WebUI RAG / AnythingLLM

↕

Your Model Server

Ollama – localhost:11434

↕

Your Model Files

GGUF quantized models · Local disk

Cost

€0 / month

All tools are free and open source

Data

100% Local

Nothing leaves your machine

⚠️

Windows users: Install Docker Desktop first. Enable WSL2 integration. All Docker commands above work identically.

Sections 03–05 · Use Cases

Seven Playbooks. One Stack.

Same foundation every time: Ollama + Open WebUI. What changes is the model you choose, the documents you upload, and how you wire the tools together.

💬

Personal AI Assistant

Daily chat, summaries, Q&A, writing

Model

Phi 4 or Qwen3:14B

Min Tier

T1 (16GB RAM)

Extra

None

1

Install Ollama + pull phi4

2

Open WebUI → set system prompt: your name, role, preferences

3

Enable conversation memory in Settings → General

4

Create a saved prompt library for repeated tasks

👨‍💻

Coding Assistant

Code completion, review, debugging in VS Code

Model

Gemma 4 26B A4B or Phi 4

Min Tier

T1 (16GB, use Phi 4)

Extra

Continue.dev VS Code extension

1

Install VS Code extension: Continue.dev

2

In Continue config: set provider to ollama, model to phi4

3

Use Tab for inline completion, Ctrl+I for chat in editor

4

Add project docs to Continue context for codebase-aware help

🧠

Personal Knowledge Base

RAG over your own notes, docs, books

Model

Gemma 4 26B A4B or Qwen3:14B

Min Tier

T1 (16GB)

Extra

Open WebUI RAG (built-in)

1

Open WebUI → Workspace → Documents → Upload files

2

Settings → Documents → Set chunk size 500, overlap 50

3

In chat, type #filename to query a specific doc

4

Enable RAG mode for all chats to always search your docs

📄

Document Analysis

Summarise, extract, compare large documents

Model

Gemma 4 26B A4B (256K context)

Min Tier

T2 (32GB RAM)

Extra

Open WebUI built-in PDF reader

1

Pull gemma4:26b-a4b – 256K context handles full reports

2

Upload PDF directly in the chat window (paperclip icon)

3

Prompt: "Summarise section 3" / "Extract all dates and deadlines"

4

Compare two docs: upload both, ask for differences

🖼️

Vision + Multi-modal

Image analysis, chart reading, screenshot Q&A

Model

Gemma 4 26B A4B or Mistral Small 3.1

Min Tier

T0 (Gemma 4 E2B for basic)

Extra

None – built into model

1

Pull a vision model: gemma4:26b-a4b or mistral-small3.1

2

In Open WebUI chat, click the image icon to attach a photo

3

Ask: "What does this architecture diagram show?" or "Read the text in this screenshot"

4

Use for: reading charts, UI screenshots, whiteboard photos, receipts

🔍

Research Assistant

Web search + local docs + synthesis

Model

Gemma 4 26B A4B

Min Tier

T2 (32GB)

Extra

Open WebUI + SearXNG (web)

1

Run SearXNG: docker run -d -p 8080:8080 searxng/searxng

2

Open WebUI → Settings → Web Search → enable, point to SearXNG

3

Toggle web search icon in chat for live web queries

4

Combine with RAG: web results + your own notes in one answer

⚙️

Agentic Workflows

AI that takes actions across tools and APIs

Model

Gemma 4 26B A4B (native function calling)

Min Tier

T2 (32GB)

Extra

n8n or Open WebUI Pipelines

1

Start n8n: docker run -d -p 5678:5678 n8nio/n8n

2

Create workflow: Trigger → Ollama AI Agent → Action node

3

Add tools: file system, HTTP request, email, calendar

4

Test: "Summarise my latest emails and save to a file"

Use Case → Minimum Tier Reference

Use Case	Min Tier	Recommended Model	Speed on T2	Extra Tool?
Personal Chat	T1	Phi 4	4–6 t/s	None
Coding (VS Code)	T1	Phi 4 / Gemma 4 E4B	4–8 t/s	Continue.dev
Knowledge Base / RAG	T1	Phi 4	3–5 t/s	Built-in WebUI
Document Analysis	T2	Gemma 4 26B A4B	3–5 t/s	None
Vision / Images	T0	Gemma 4 E2B+	2–4 t/s	None
Research + Web	T2	Gemma 4 26B A4B	3–5 t/s	SearXNG
Agentic Workflows	T2	Gemma 4 26B A4B	3–5 t/s	n8n

Section 06 · Fine-tuning

Make the Model Sound Like You.

A general model knows everything broadly but nothing specifically about your domain, your writing style, or your audience. Fine-tuning teaches it. Here's the complete picture – tools, techniques, process, and what hardware you actually need.

Choose Your Technique

System Prompt

No GPU · Works now

Inject your persona, style, rules into every conversation. No training required. Limited to what fits in context.

RAG

T1+ · Works now

Teach by retrieval. Your docs become the model's memory at query time. No training, always up-to-date.

QLoRA ⭐ Recommended

8GB VRAM min (or free Colab)

Efficient fine-tuning. Freezes most weights, trains small adapter matrices. 60% less memory than full fine-tune. Best ROI.

LoRA

8–16 GB VRAM

Similar to QLoRA but with higher precision. Slightly more memory, marginally better quality for the same dataset.

Full Fine-tuning

4× model size in VRAM

Updates every weight. Maximum flexibility, maximum hardware requirement. For serious domain adaptation at scale.

The Fine-tuning Process

Step 01

Collect Data

200–1000 examples min. Your writing, your Q&As, your domain content.

Step 02

Format

Convert to JSONL. ChatML or Alpaca format. instruction / input / output pairs.

Step 03

Choose Base

Pick your base model. 7B for speed, 14B for quality. Llama/Qwen/Gemma all work.

Step 04

Configure

Set LoRA rank (8–64), alpha, learning rate (2e-4). Use Unsloth defaults first.

Step 05

Train

On free Colab T4 (16GB) or local GPU. 1–3 hours for 7B model on 1000 examples.

Step 06

Evaluate

Test on held-out examples. Check perplexity drop. Does it sound right?

Step 07

Export

Export to GGUF format (Q4 or Q8). Merge LoRA adapter into base model.

Step 08

Deploy

Load GGUF into Ollama. Serve via Open WebUI. Your model, running locally.

Fine-tuning Tools

Unsloth ⭐

Start Here

2× faster training, 60% less memory. Supports Llama, Qwen, Gemma, Phi, Mistral. Free tier on Google Colab. Best for beginners.

LLaMA Factory

Web UI

GUI-based fine-tuning. No code required. Good for non-technical teams. Supports most popular model families.

Axolotl

Config-Driven

YAML config files. Very flexible. Supports many dataset formats and techniques. Best for custom training loops.

HF PEFT

Full Control

Hugging Face's Parameter-Efficient Fine-Tuning library. Maximum code control. Requires most Python knowledge.

Dataset Format (ChatML / JSONL)

jsonl – one example per line

{
  "messages": [
    {"role": "system",    "content": "You are a senior cloud architect who explains complex systems to junior engineers in plain Urdu-influenced English."},
    {"role": "user",      "content": "Explain microservices in simple terms."},
    {"role": "assistant", "content": "Think of it like a dhaba kitchen..."}
  ]
}
# Repeat this pattern for each training example (200+ recommended)

python – Unsloth QLoRA quick start

from unsloth import FastLanguageModel
from trl import SFTTrainer

# Load model + apply LoRA
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Phi-4",
    max_seq_length = 2048,
    load_in_4bit = True,   # QLoRA
)
model = FastLanguageModel.get_peft_model(
    model, r=16, lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
)

# Train with your JSONL dataset
trainer = SFTTrainer(model=model, tokenizer=tokenizer,
                     train_dataset=your_dataset, ...)
trainer.train()

# Export to GGUF for Ollama
model.save_pretrained_gguf("my_model", tokenizer, quantization_method="q4_k_m")

💡

No local GPU? Google Colab's free T4 (16GB VRAM) is enough for QLoRA on 7B–14B models. Kaggle offers a P100 (16GB) with 30 hours/week free. Both work with Unsloth out of the box.

Section 07 · Career Roadmap

What You Can Build. When You Can Build It.

Local AI is a leverage skill. Every stage below unlocks a new tier of things you can build, automate, or offer. The timeline is real – these are hours-to-weeks, not years.

◈

Learn.

Month 1 – The Foundation

Week 1Install Ollama, run your first model, have a conversation

Week 2Set up Open WebUI, create a custom system prompt, explore 3 models

Week 3Upload 10 documents, build your first RAG knowledge base

Week 4Compare Phi 4 vs Gemma 4 vs Qwen3 on the same prompt set

Unlocked: Local chat assistant · Personal knowledge base · Basic document analysis · Offline AI with no API bills

◆

Excel.

Months 2–3 – Applied Skills

Month 2Set up Continue.dev in VS Code, code with local AI daily

Month 2Build a RAG pipeline over a real codebase or research paper set

Month 3First agentic workflow in n8n – automate a real task you do weekly

Month 3Add web search to your research assistant via SearXNG

Unlocked: AI pair programmer · Codebase-aware assistant · Automated research pipeline · Agentic task automation

◉

Lead.

Months 4–6 – Expert Practitioner

Month 4Curate your first fine-tuning dataset (200+ examples from your writing)

Month 5QLoRA fine-tune a 7B model on Colab, export to GGUF, load in Ollama

Month 5Build a domain-specific chatbot and deploy it for a real user

Month 6Run your own OpenAI-compatible local API endpoint for your team

Unlocked: Custom domain model · Team AI infrastructure · Fine-tuning consultancy · Local AI deployment for clients

What Month-6 Engineers Can Build and Offer

What You Can Build	Who Needs It	Why It Matters
Company Knowledge Chatbot	Any business with internal docs	Replace hours of searching with seconds of querying
Private Code Review Bot	Dev teams with compliance constraints	No proprietary code sent to OpenAI
Domain-tuned Writing Assistant	Law firms, consultancies, media	Model that speaks in their exact voice and terminology
Local Customer Support AI	SMBs that can't afford cloud AI at scale	Cost: €0/month. Quality: better than GPT-3.5 on their domain
Research Synthesis Pipeline	Researchers, analysts, journalists	Ingest 100 PDFs, extract structured insights automatically
Agentic Task Automation	Operations teams, founders	AI that monitors, decides, and acts – not just chats

◈

The real leverage: Most engineers know how to use ChatGPT. Very few know how to run, customise, and deploy their own models. That gap is where expertise lives – and where consulting, tooling, and product opportunities sit. This skill set is not mainstream yet. It will be.

Open Source

Every Tool. Every Repo.

The whole stack is free and runs on your machine. Here is every project this series is built on, with its source. Star the ones you rely on; that is how open source stays alive.

Ollama ↗

The model server. Pulls and runs local models behind an OpenAI-compatible API at localhost:11434.

github.com/ollama/ollama

Open WebUI ↗

The chat interface. A self-hosted, offline ChatGPT-style front end with built-in RAG.

github.com/open-webui/open-webui

n8n ↗

The agent layer. Fair-code workflow automation that wires your model to tools, APIs and triggers.

github.com/n8n-io/n8n

AnythingLLM ↗

An all-in-one local knowledge base and agent app, an alternative to Open WebUI's built-in RAG.

github.com/Mintplex-Labs/anything-llm

SearXNG ↗

A private metasearch engine that gives your model live, untracked web results.

github.com/searxng/searxng

Continue ↗

The in-editor coding agent. Your local model as a pair programmer in VS Code or JetBrains.

github.com/continuedev/continue

Unsloth ↗

Fast, low-memory fine-tuning. Train a model on your own data on a free GPU.

github.com/unslothai/unsloth

◈

The models themselves are open weights. You pull them through Ollama's library at ollama.com/library, where Gemma, Qwen, Phi, Mistral, Llama and DeepSeek all live.