1.1 — GPU VRAM vs. System RAM
GPU VRAM vs. System RAM: The Inference Engine
In local LLM deployment, your GPU's VRAM (Video RAM) is the primary bottleneck for performance and context window size. In this lesson, we breakdown the technical requirements for building a local "Laptop Server."
🏗️ The Memory Hierarchy
To run a model, the entire set of weights must reside in the fastest memory possible.
- VRAM (GPU): 100x faster than system RAM. Essential for low-latency responses.
- Unified Memory (Apple Silicon): Shared between CPU/GPU. Allows for massive models (70B+) on a single chip.
- System RAM: The "Swap" space. If a model doesn't fit in VRAM, it spills over here, causing a 10x-50x speed drop.
Technical Snippet: VRAM Calculation for Quantized Models
A 7B parameter model at 4-bit quantization (Q4_K_M) requires approximately:
(7 Billion Parameters * 0.7 bytes per weight) + 1GB Context Buffer = ~6GB VRAM.
Nuance: The Context Buffer
As your chat gets longer, the "KV Cache" grows. If you only have 8GB of VRAM, you can run a 7B model, but your context window will be limited to ~4k tokens before it overflows into slow system RAM.
Practice Lab: Hardware Benchmarking
- Identify: Open your Task Manager (Windows) or Activity Monitor (Mac). Find your "Dedicated Video Memory."
- Benchmark: Download LM Studio and load a
Llama-3-8B-Q4_K_Mmodel. - Analyze: Run a long prompt and monitor the "Tokens Per Second" (TPS). Note when the speed drops as the context window fills.
📺 Recommended Videos & Resources
-
Ollama Official Documentation — Complete setup guides and model library
- Type: Documentation / Official Site
- Link description: Browse the Ollama models library and installation guides
-
LM Studio Model Hub — GGUF model discovery and testing
- Type: Tool / GUI Application
- Link description: Download LM Studio desktop app for visual quantization level testing
-
VRAM Optimization for Local LLMs — Performance tuning guides
- Type: YouTube
- Link description: Search YouTube for "VRAM optimization local LLM inference 2024"
-
RTX GPU Comparison for AI (Pakistan) — Market-specific pricing on OLX
- Type: Local Market / Pakistan
- Link description: Check OLX Karachi and Lahore sections for used GPU availability and pricing
-
KV Cache Quantization Explained — Technical deep-dive on memory management
- Type: GitHub Documentation
- Link description: Visit llama.cpp repository for advanced quantization details
🎯 Mini-Challenge
Challenge: Find the exact VRAM requirement for running a 7B model at Q4_K_M quantization with an 8k context window on your machine. Use LM Studio or Ollama's verbose mode to measure actual usage. Screenshot your results showing the exact VRAM occupied.
Time: 5 minutes (installation may take longer)
🖼️ Visual Reference
📊 Memory Hierarchy in Local Inference
┌─────────────────────────────────────────────────────────┐
│ CPU System RAM (Fast Access, Limited) │
│ ┌───────────────────────────────────────────────────┐ │
│ │ GPU VRAM (100x Faster, Model Weights Live Here) │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ KV Cache (Context Window — Grows with Chat) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Model Weights: 7B @ Q4 = ~5.5GB │ │
│ │ Context (8k tokens @ Q4) = ~2.5GB │ │
│ │ Total VRAM Needed: ~8GB │ │
│ └───────────────────────────────────────────────┘ │
│ │
│ ⚠️ Overflow Point: When cache > VRAM remainder, │
│ inference spills to System RAM (10-50x slower) │
└─────────────────────────────────────────────────────────┘
Homework: The PKR Economics of Compute
Calculate the cost of an RTX 3060 (12GB VRAM) vs. an RTX 4060 (8GB VRAM) in the local Pakistani market. Which card is better for running a private "Lead Scoring" bot 24/7? Justify your choice based on Context Window size.
Lesson Summary
Quiz: GPU VRAM vs. System RAM: The Inference Engine
5 questions to test your understanding. Score 60% or higher to pass.