From Engineer to AI Engineer
Local AI Series · 7 Sections
Run AI Locally.
Your Machine. Your Data. Your Rules.

The complete cookbook for running open-source AI models on your own hardware – from a budget laptop to a workstation. No cloud bills. No data leaving your machine. Real engineers. Real setups. No fluff.

7 Sections Open Source Only Fine-tuning Included Free to Run
The Series
Seven Sections. One Complete Picture.

Each episode is a standalone guide. Together they take you from zero – never run a local model – to being the person in your team who actually understands how local AI works.

Section 01 · Learn.
Why Local AI? + Hardware Guide
Which model fits which machine. The hardware matrix every engineer needs before downloading anything.
01
Section 02 · Learn.
The Stack That Never Changes
Ollama + Open WebUI – the two tools that underpin every single use case in this series. Set up once, use forever.
02
Section 03 · Excel.
Your Brain in a Box – RAG Setup
Upload your documents, notes, and PDFs. Ask questions. Get answers that cite your own knowledge base.
03
Section 04 · Excel.
Your AI Pair Programmer
Set up a local coding assistant in VS Code. Review code, generate functions, debug errors – all offline.
04
Section 05 · Excel.
AI That Does Things – Agents
Connect your model to tools. File systems, APIs, web search. Agentic workflows that actually run tasks.
05
Section 06 · Lead.
Making It Yours – Fine-tuning
Train a model on your data. Your writing style. Your domain. Your brand voice. QLoRA on free GPU.
06
Section 07 · Lead.
From Hobbyist to AI Engineer
The career roadmap. What skills open which doors. How local AI expertise translates into real leverage.
07
Section 01 · Hardware Guide
Which Machine Can Run What

Five tiers. Every common hardware profile covered. Pick your tier and see exactly which models run, what use cases unlock, and what you can't do yet.

RAM
4 – 8 GB
System RAM, no GPU
Storage Free
50 GB+
For model files
Profile
Budget Laptop
2–4 GB available for LLM
Gemma 4 E2B
~1.5 GB RAM (Q8)
Best FitMultimodal
Qwen3:1.7B
~1.1 GB RAM (Q4)
Fastest
Phi-4-mini (3.8B)
~2.5 GB RAM (Q4)
Good Reasoning
Gemma 4 E4B
~3.5 GB RAM (Q4)
Multimodal
Basic chat and Q&A – conversations, explanations, summaries at 2–4 tokens/sec
Simple text generation – short articles, email drafts, basic code snippets
Image understanding (E2B/E4B) – describe images, read charts, basic OCR
⚠️
Small RAG queries – possible with <20 short documents, slow ingestion
Complex reasoning, multi-step coding, long documents – model too small, context too short
RAM
16 GB
System RAM, no GPU
Storage Free
100 GB+
For model files
Profile
Entry Laptop/Desktop
8–10 GB available for LLM
Phi 4 (14B)
~8 GB RAM (Q4)
Best FitStrong Reasoning
Qwen3:7B
~4.5 GB RAM (Q4)
Fast
Mistral 7B
~4.5 GB RAM (Q4)
Solid All-Round
Gemma 4 E4B
~3.5 GB RAM (Q4)
Vision
Good coding assistance – Phi 4 rivals GPT-3.5 on code tasks at this size
Document summaries and analysis – handles 10K–20K token documents comfortably
Basic RAG pipeline – 50–100 documents, Open WebUI built-in RAG works well
⚠️
Complex multi-step agents – possible but slow, model may lose context on long chains
70B+ models, vision on large images – not enough RAM for larger quantizations
RAM
32 GB
System RAM, no GPU
Storage Free
200 GB+
Multiple models
Profile
Workhorse (Your Setup)
18–22 GB available for LLM
Gemma 4 26B A4B
~18 GB RAM (Q4)
Best FitVision + Agents
Qwen3:14B
~9 GB RAM (Q4)
Faster Option
Phi 4 (14B)
~8 GB RAM (Q4)
Reasoning
Mistral Small 3.1
~12 GB RAM (Q4)
Vision
Advanced reasoning and complex Q&A – 26B A4B performs close to frontier models on reasoning benchmarks
Full RAG pipeline – hundreds of documents, semantic search, cited answers from your knowledge base
Vision tasks – Gemma 4 and Mistral Small 3.1 both handle image + text natively
Agentic workflows with tool calling – Gemma 4 has native function calling, works with n8n and Open WebUI Pipelines
256K token context – process entire codebases, long reports, multi-chapter documents in one pass
⚠️
Speed – CPU inference runs at 3–8 tokens/sec. Usable, not instant. Background tasks are fine; real-time chat feels slightly slow.
GPU VRAM
8 – 16 GB
RTX 3080/4070/4080
System RAM
32 GB+
GPU offloads to RAM if needed
Profile
Gaming PC / ML Workstation
20–50× faster than CPU only
Gemma 4 31B
~20 GB VRAM (Q4)
Best Quality
Qwen3:30B
~18 GB VRAM (Q4)
Best Coding
Llama 4 Scout
~16 GB VRAM (Q4)
Long Context
StarCoder2-15B
~9 GB VRAM (Q4)
Code-Specialist
Everything in T2, 20–50× faster – real-time chat speeds, 30–80 tokens/sec
QLoRA fine-tuning – fine-tune 7B models locally, 13B possible on 16GB VRAM
Multiple concurrent models – run Ollama with 2 models loaded simultaneously
Production-grade inference – can serve as a real local API for small teams
GPU VRAM
24 GB+
RTX 4090 / A100 / H100
System RAM
64 GB+
For large model offload
Profile
Serious ML Setup / Cloud VM
Near-frontier local performance
Qwen3:72B
~42 GB (Q4, split GPU+RAM)
Best Quality
DeepSeek V3
Requires multi-GPU or MoE
Frontier-Level
Llama 4 Scout Full
~24 GB VRAM (Q4)
10M Context
StarCoder2-33B
~20 GB VRAM (Q4)
Best Code Model
Near-frontier quality – Qwen3:72B rivals GPT-4 on most benchmarks
Full fine-tuning of 7B–13B models – LoRA or full parameter training locally
Production API serving – serve your own OpenAI-compatible endpoint for your team
QLoRA of 30B–70B models – serious research and domain adaptation
Section 02 · The Stack
The Setup That Never Changes.

Every use case in this series builds on the same two tools. Install this once. Every playbook below assumes you have these running.

01
Install Ollama – Your Model Server
Ollama downloads and runs models locally. It exposes an OpenAI-compatible API at localhost:11434. Every other tool connects to it.
bash
# Linux / Mac
curl -fsSL https://ollama.com/install.sh | sh

# Windows → download installer from ollama.com

# Pull your first model (T2 recommendation)
ollama pull gemma4:26b-a4b
ollama pull phi4
ollama pull qwen3:14b

# List available models
ollama list
02
Launch Open WebUI – Your Chat Interface
A full ChatGPT-like interface that connects to Ollama. Runs in your browser. Supports RAG, image upload, conversation history, user management, and more.
bash (Docker)
docker run -d -p 3000:80 \
  -v open-webui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

# Open browser at:
# http://localhost:3000
💡
No Docker? Use the Python install: pip install open-webui && open-webui serve
03
Enable RAG – Your Knowledge Base
Open WebUI has built-in RAG. No extra tools needed for basic use. Upload documents, then reference them in chat.
open webui steps
# In the Open WebUI sidebar:
1. Go to Workspace → Documents
2. Upload PDFs, DOCX, TXT, MD files
3. In chat, type # to reference a document
4. Or enable "RAG Mode" to always search docs

# Advanced: AnythingLLM for full local RAG
docker run -d -p 3001:3001 \
  -v anythingllm:/app/server/storage \
  --name anythingllm \
  mintplexlabs/anythingllm
04
Optional: Add n8n for Agentic Workflows
When you want your AI to take actions – send emails, write files, search the web, call APIs – connect n8n to Ollama.
bash
docker run -d -p 5678:5678 \
  -v n8n_data:/home/node/.n8n \
  --name n8n \
  n8nio/n8n

# Open n8n at http://localhost:5678
# Add Ollama node → connect to localhost:11434
Your Complete Stack
Your Chat Interface
Open WebUI – localhost:3000
Your Knowledge Base
Open WebUI RAG / AnythingLLM
Your Model Server
Ollama – localhost:11434
Your Model Files
GGUF quantized models · Local disk
Cost
€0 / month
All tools are free and open source
Data
100% Local
Nothing leaves your machine
⚠️
Windows users: Install Docker Desktop first. Enable WSL2 integration. All Docker commands above work identically.
Sections 03–05 · Use Cases
Seven Playbooks. One Stack.

Same foundation every time: Ollama + Open WebUI. What changes is the model you choose, the documents you upload, and how you wire the tools together.

💬
Personal AI Assistant
Daily chat, summaries, Q&A, writing
Model
Phi 4 or Qwen3:14B
Min Tier
T1 (16GB RAM)
Extra
None
1
Install Ollama + pull phi4
2
Open WebUI → set system prompt: your name, role, preferences
3
Enable conversation memory in Settings → General
4
Create a saved prompt library for repeated tasks
👨‍💻
Coding Assistant
Code completion, review, debugging in VS Code
Model
Gemma 4 26B A4B or Phi 4
Min Tier
T1 (16GB, use Phi 4)
Extra
Continue.dev VS Code extension
1
Install VS Code extension: Continue.dev
2
In Continue config: set provider to ollama, model to phi4
3
Use Tab for inline completion, Ctrl+I for chat in editor
4
Add project docs to Continue context for codebase-aware help
🧠
Personal Knowledge Base
RAG over your own notes, docs, books
Model
Gemma 4 26B A4B or Qwen3:14B
Min Tier
T1 (16GB)
Extra
Open WebUI RAG (built-in)
1
Open WebUI → Workspace → Documents → Upload files
2
Settings → Documents → Set chunk size 500, overlap 50
3
In chat, type #filename to query a specific doc
4
Enable RAG mode for all chats to always search your docs
📄
Document Analysis
Summarise, extract, compare large documents
Model
Gemma 4 26B A4B (256K context)
Min Tier
T2 (32GB RAM)
Extra
Open WebUI built-in PDF reader
1
Pull gemma4:26b-a4b – 256K context handles full reports
2
Upload PDF directly in the chat window (paperclip icon)
3
Prompt: "Summarise section 3" / "Extract all dates and deadlines"
4
Compare two docs: upload both, ask for differences
🖼️
Vision + Multi-modal
Image analysis, chart reading, screenshot Q&A
Model
Gemma 4 26B A4B or Mistral Small 3.1
Min Tier
T0 (Gemma 4 E2B for basic)
Extra
None – built into model
1
Pull a vision model: gemma4:26b-a4b or mistral-small3.1
2
In Open WebUI chat, click the image icon to attach a photo
3
Ask: "What does this architecture diagram show?" or "Read the text in this screenshot"
4
Use for: reading charts, UI screenshots, whiteboard photos, receipts
🔍
Research Assistant
Web search + local docs + synthesis
Model
Gemma 4 26B A4B
Min Tier
T2 (32GB)
Extra
Open WebUI + SearXNG (web)
1
Run SearXNG: docker run -d -p 8080:8080 searxng/searxng
2
Open WebUI → Settings → Web Search → enable, point to SearXNG
3
Toggle web search icon in chat for live web queries
4
Combine with RAG: web results + your own notes in one answer
⚙️
Agentic Workflows
AI that takes actions across tools and APIs
Model
Gemma 4 26B A4B (native function calling)
Min Tier
T2 (32GB)
Extra
n8n or Open WebUI Pipelines
1
Start n8n: docker run -d -p 5678:5678 n8nio/n8n
2
Create workflow: Trigger → Ollama AI Agent → Action node
3
Add tools: file system, HTTP request, email, calendar
4
Test: "Summarise my latest emails and save to a file"
Use Case → Minimum Tier Reference
Use CaseMin TierRecommended ModelSpeed on T2Extra Tool?
Personal ChatT1Phi 44–6 t/sNone
Coding (VS Code)T1Phi 4 / Gemma 4 E4B4–8 t/sContinue.dev
Knowledge Base / RAGT1Phi 43–5 t/sBuilt-in WebUI
Document AnalysisT2Gemma 4 26B A4B3–5 t/sNone
Vision / ImagesT0Gemma 4 E2B+2–4 t/sNone
Research + WebT2Gemma 4 26B A4B3–5 t/sSearXNG
Agentic WorkflowsT2Gemma 4 26B A4B3–5 t/sn8n
Section 06 · Fine-tuning
Make the Model Sound Like You.

A general model knows everything broadly but nothing specifically about your domain, your writing style, or your audience. Fine-tuning teaches it. Here's the complete picture – tools, techniques, process, and what hardware you actually need.

Choose Your Technique
System Prompt
No GPU · Works now
Inject your persona, style, rules into every conversation. No training required. Limited to what fits in context.
RAG
T1+ · Works now
Teach by retrieval. Your docs become the model's memory at query time. No training, always up-to-date.
QLoRA ⭐ Recommended
8GB VRAM min (or free Colab)
Efficient fine-tuning. Freezes most weights, trains small adapter matrices. 60% less memory than full fine-tune. Best ROI.
LoRA
8–16 GB VRAM
Similar to QLoRA but with higher precision. Slightly more memory, marginally better quality for the same dataset.
Full Fine-tuning
4× model size in VRAM
Updates every weight. Maximum flexibility, maximum hardware requirement. For serious domain adaptation at scale.
The Fine-tuning Process
Step 01
Collect Data
200–1000 examples min. Your writing, your Q&As, your domain content.
Step 02
Format
Convert to JSONL. ChatML or Alpaca format. instruction / input / output pairs.
Step 03
Choose Base
Pick your base model. 7B for speed, 14B for quality. Llama/Qwen/Gemma all work.
Step 04
Configure
Set LoRA rank (8–64), alpha, learning rate (2e-4). Use Unsloth defaults first.
Step 05
Train
On free Colab T4 (16GB) or local GPU. 1–3 hours for 7B model on 1000 examples.
Step 06
Evaluate
Test on held-out examples. Check perplexity drop. Does it sound right?
Step 07
Export
Export to GGUF format (Q4 or Q8). Merge LoRA adapter into base model.
Step 08
Deploy
Load GGUF into Ollama. Serve via Open WebUI. Your model, running locally.
Fine-tuning Tools
Unsloth ⭐
Start Here
2× faster training, 60% less memory. Supports Llama, Qwen, Gemma, Phi, Mistral. Free tier on Google Colab. Best for beginners.
LLaMA Factory
Web UI
GUI-based fine-tuning. No code required. Good for non-technical teams. Supports most popular model families.
Axolotl
Config-Driven
YAML config files. Very flexible. Supports many dataset formats and techniques. Best for custom training loops.
HF PEFT
Full Control
Hugging Face's Parameter-Efficient Fine-Tuning library. Maximum code control. Requires most Python knowledge.
Dataset Format (ChatML / JSONL)
jsonl – one example per line
{
  "messages": [
    {"role": "system",    "content": "You are a senior cloud architect who explains complex systems to junior engineers in plain Urdu-influenced English."},
    {"role": "user",      "content": "Explain microservices in simple terms."},
    {"role": "assistant", "content": "Think of it like a dhaba kitchen..."}
  ]
}
# Repeat this pattern for each training example (200+ recommended)
python – Unsloth QLoRA quick start
from unsloth import FastLanguageModel
from trl import SFTTrainer

# Load model + apply LoRA
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Phi-4",
    max_seq_length = 2048,
    load_in_4bit = True,   # QLoRA
)
model = FastLanguageModel.get_peft_model(
    model, r=16, lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
)

# Train with your JSONL dataset
trainer = SFTTrainer(model=model, tokenizer=tokenizer,
                     train_dataset=your_dataset, ...)
trainer.train()

# Export to GGUF for Ollama
model.save_pretrained_gguf("my_model", tokenizer, quantization_method="q4_k_m")
💡
No local GPU? Google Colab's free T4 (16GB VRAM) is enough for QLoRA on 7B–14B models. Kaggle offers a P100 (16GB) with 30 hours/week free. Both work with Unsloth out of the box.
Section 07 · Career Roadmap
What You Can Build. When You Can Build It.

Local AI is a leverage skill. Every stage below unlocks a new tier of things you can build, automate, or offer. The timeline is real – these are hours-to-weeks, not years.

Learn.
Month 1 – The Foundation
Week 1Install Ollama, run your first model, have a conversation
Week 2Set up Open WebUI, create a custom system prompt, explore 3 models
Week 3Upload 10 documents, build your first RAG knowledge base
Week 4Compare Phi 4 vs Gemma 4 vs Qwen3 on the same prompt set
Unlocked: Local chat assistant · Personal knowledge base · Basic document analysis · Offline AI with no API bills
Excel.
Months 2–3 – Applied Skills
Month 2Set up Continue.dev in VS Code, code with local AI daily
Month 2Build a RAG pipeline over a real codebase or research paper set
Month 3First agentic workflow in n8n – automate a real task you do weekly
Month 3Add web search to your research assistant via SearXNG
Unlocked: AI pair programmer · Codebase-aware assistant · Automated research pipeline · Agentic task automation
Lead.
Months 4–6 – Expert Practitioner
Month 4Curate your first fine-tuning dataset (200+ examples from your writing)
Month 5QLoRA fine-tune a 7B model on Colab, export to GGUF, load in Ollama
Month 5Build a domain-specific chatbot and deploy it for a real user
Month 6Run your own OpenAI-compatible local API endpoint for your team
Unlocked: Custom domain model · Team AI infrastructure · Fine-tuning consultancy · Local AI deployment for clients
What Month-6 Engineers Can Build and Offer
What You Can BuildWho Needs ItWhy It Matters
Company Knowledge ChatbotAny business with internal docsReplace hours of searching with seconds of querying
Private Code Review BotDev teams with compliance constraintsNo proprietary code sent to OpenAI
Domain-tuned Writing AssistantLaw firms, consultancies, mediaModel that speaks in their exact voice and terminology
Local Customer Support AISMBs that can't afford cloud AI at scaleCost: €0/month. Quality: better than GPT-3.5 on their domain
Research Synthesis PipelineResearchers, analysts, journalistsIngest 100 PDFs, extract structured insights automatically
Agentic Task AutomationOperations teams, foundersAI that monitors, decides, and acts – not just chats
The real leverage: Most engineers know how to use ChatGPT. Very few know how to run, customise, and deploy their own models. That gap is where expertise lives – and where consulting, tooling, and product opportunities sit. This skill set is not mainstream yet. It will be.
Open Source
Every Tool. Every Repo.

The whole stack is free and runs on your machine. Here is every project this series is built on, with its source. Star the ones you rely on; that is how open source stays alive.

Ollama
The model server. Pulls and runs local models behind an OpenAI-compatible API at localhost:11434.
github.com/ollama/ollama
Open WebUI
The chat interface. A self-hosted, offline ChatGPT-style front end with built-in RAG.
github.com/open-webui/open-webui
n8n
The agent layer. Fair-code workflow automation that wires your model to tools, APIs and triggers.
github.com/n8n-io/n8n
AnythingLLM
An all-in-one local knowledge base and agent app, an alternative to Open WebUI's built-in RAG.
github.com/Mintplex-Labs/anything-llm
SearXNG
A private metasearch engine that gives your model live, untracked web results.
github.com/searxng/searxng
Continue
The in-editor coding agent. Your local model as a pair programmer in VS Code or JetBrains.
github.com/continuedev/continue
Unsloth
Fast, low-memory fine-tuning. Train a model on your own data on a free GPU.
github.com/unslothai/unsloth
The models themselves are open weights. You pull them through Ollama's library at ollama.com/library, where Gemma, Qwen, Phi, Mistral, Llama and DeepSeek all live.