1.1 — GPU VRAM vs. System RAM
GPU VRAM vs. System RAM: The Inference Engine
In local LLM deployment, your GPU's VRAM (Video RAM) is the single most important hardware constraint. Every model you run, every context window you open, every concurrent request you serve — all of it lives and dies by VRAM. In this lesson we break down the full memory hierarchy, the math behind VRAM calculations, and how to make smart hardware decisions in the Pakistani market where GPU prices run 30-50% above US retail.
The Memory Hierarchy
When a model generates tokens, it needs to access its weights on every forward pass. The speed at which it can read those weights determines your Tokens Per Second (TPS). Here is the full hierarchy from fastest to slowest:
MEMORY HIERARCHY FOR LLM INFERENCE
====================================
Tier 1 — GPU VRAM (Fastest)
├── Speed: 900 GB/s - 3.35 TB/s (RTX 3090 to H100)
├── Latency: ~100 nanoseconds
├── Capacity: 6 GB (RTX 3050) to 80 GB (A100)
├── Role: Model weights + KV cache MUST live here
└── Cost (PK): RTX 3060 12GB = PKR 65,000-85,000
Tier 2 — Unified Memory (Apple Silicon)
├── Speed: 200-400 GB/s (M2 Ultra: 800 GB/s)
├── Latency: ~150 nanoseconds
├── Capacity: 8 GB to 192 GB (M2 Ultra)
├── Role: Shared CPU+GPU pool — run 70B on Mac Studio
└── Cost (PK): M2 Mac Mini 16GB = PKR 175,000-200,000
Tier 3 — System RAM (Fallback / Slow)
├── Speed: 50-100 GB/s (DDR5)
├── Latency: ~70 nanoseconds (but VRAM bandwidth gap kills TPS)
├── Capacity: 8 GB to 128 GB
├── Role: Overflow when model > VRAM. 10-50x speed penalty.
└── Use case: CPU-only inference (last resort)
Tier 4 — NVMe SSD (Emergency Swap)
├── Speed: 5-7 GB/s (PCIe 4.0)
├── Latency: ~100 microseconds
├── Role: Model layers streamed from disk (llama.cpp --mmap)
└── Verdict: Extremely slow. Only for experimentation.
The Overflow Penalty in Numbers
When a model spills from VRAM to system RAM on an RTX 3060 paired with DDR5 RAM:
| Memory Region | Bandwidth | TPS (Llama 3 8B Q4) | Penalty |
|---|---|---|---|
| Full VRAM (12 GB) | 360 GB/s | 38 TPS | Baseline |
| 8 GB VRAM + 4 GB RAM | 360 / 50 GB/s mixed | 12 TPS | 3x slower |
| Full System RAM | 50 GB/s | 4 TPS | 10x slower |
| NVMe streaming | 5 GB/s | 0.4 TPS | 95x slower |
The takeaway: a model that does not fully fit in VRAM is not just slightly slower — it is unusable for production automation.
VRAM Calculation: The Exact Formula
VRAM CALCULATION FOR GGUF MODELS
==================================
Formula:
VRAM_needed = (Parameters × bits_per_weight / 8) + KV_cache_size
Where:
KV_cache_size = (2 × n_layers × n_heads × head_dim × context_tokens × bytes_per_element)
Simplified approximation:
KV_cache ≈ context_tokens × 0.5 MB per 1k tokens (for 8B model at Q4)
WORKED EXAMPLES:
Model: Llama 3 8B at Q4_K_M
├── Weights: 8B × 0.55 bytes/param = 4.4 GB
├── KV cache at 2k context: 2,048 × 0.5MB/1k = 1.0 GB
├── KV cache at 8k context: 8,192 × 0.5MB/1k = 4.0 GB
├── Overhead (activations, buffers): 0.5 GB
│
├── 2k context total: 4.4 + 1.0 + 0.5 = 5.9 GB → fits RTX 3050 8GB
└── 8k context total: 4.4 + 4.0 + 0.5 = 8.9 GB → needs RTX 3060 12GB
Model: Llama 3 70B at Q4_K_M
├── Weights: 70B × 0.55 bytes/param = 38.5 GB
├── KV cache at 4k context: 4,096 × 2.5MB/1k = 10.2 GB
├── Total: ~49 GB → needs RTX 3090 × 2 or A100 80GB
Model: Phi-3 Mini 3.8B at Q4_K_M
├── Weights: 3.8B × 0.55 bytes/param = 2.1 GB
├── KV cache at 4k context: ~2.0 GB
├── Total: ~4.5 GB → fits GTX 1660 Ti 6GB
└── Verdict: Best model for budget Pakistani laptops
Quantization Impact on VRAM
Different GGUF quantization formats yield different model sizes for the same parameter count:
| Quantization | Bits/Weight | Llama 3 8B Size | VRAM for 4k ctx | Quality Loss |
|---|---|---|---|---|
| FP16 | 16 | 16.0 GB | 22 GB | None (reference) |
| Q8_0 | 8 | 8.0 GB | 12 GB | Negligible |
| Q6_K | 6 | 6.1 GB | 10 GB | < 0.5% |
| Q5_K_M | 5 | 5.3 GB | 9 GB | < 1% |
| Q4_K_M | 4 | 4.4 GB | 8 GB | 1-2% (sweet spot) |
| Q3_K_M | 3 | 3.5 GB | 7 GB | 3-5% |
| Q2_K | 2 | 2.8 GB | 6 GB | 8-12% |
The K_M suffix indicates "K-quant medium" — these are Hugging Face's improved quantization variants that use different bit widths for different weight matrices (attention layers get more bits than feed-forward layers). They outperform plain Qx_0 variants at the same nominal bit-count.
KV Cache Growth: The Context Tax
The KV cache is the memory structure that stores the model's "attention states" — essentially the model's working memory for the current conversation. Unlike model weights (which are fixed), the KV cache grows with every token you generate.
KV CACHE GROWTH VISUALIZATION
================================
RTX 3060 (12 GB VRAM) running Llama 3 8B Q4:
Available VRAM: 12.0 GB
Model weights: - 4.4 GB
---------
Free for KV cache: 7.6 GB
Context tokens: KV cache size: TPS estimate: Status:
1,024 tokens 0.5 GB 42 TPS Excellent
2,048 tokens 1.0 GB 38 TPS Good
4,096 tokens 2.0 GB 32 TPS Good
8,192 tokens 4.0 GB 22 TPS Acceptable
16,384 tokens 8.0 GB OVERFLOW ← spills to RAM
32,768 tokens 16.0 GB OVERFLOW ← unusable
Rule: Never fill more than 80% of available VRAM with KV cache.
Safety limit on 12 GB card: 7.6 GB × 0.8 = ~6 GB for KV = ~12k tokens max
Pakistan GPU Market Reality
The Pakistani hardware market has specific dynamics that affect your purchasing decisions:
GPU BUYING GUIDE — PAKISTAN 2026
==================================
Budget Tier (PKR 30,000-65,000):
├── GTX 1660 Ti / RTX 2060 (6 GB VRAM)
│ ├── From: OLX Karachi, OLX Lahore, Hafeez Centre Islamabad
│ ├── Models: Phi-3 Mini, Qwen 2.5 1.5B, TinyLlama
│ └── Use case: Lead scoring, keyword extraction, classification
Mid-Range Tier (PKR 65,000-100,000):
├── RTX 3060 12 GB (BEST VALUE in Pakistan)
│ ├── From: Hafeez Centre Lahore, Saddar Karachi (new)
│ │ OLX (used — check mining wear)
│ ├── Models: Llama 3 8B Q4, Mistral 7B, Qwen 2.5 7B
│ └── Use case: Full agency automation stack
├── RTX 4060 8 GB
│ ├── Price premium: 20% over 3060 for less VRAM
│ ├── Pro: Better power efficiency (Ada architecture)
│ └── Verdict: Worse value than 3060 for LLM work
Pro Tier (PKR 130,000-220,000):
├── RTX 3090 24 GB (OLX used — ex-crypto miners)
│ ├── Check: GPU-Z artifact test before buying
│ ├── Models: Llama 3 70B Q4, Mixtral 8x7B
│ └── Use case: Production agency server
├── M2 Mac Mini 16 GB Unified
│ ├── Silent, 10W idle vs 350W for RTX 3090
│ ├── Best for 24/7 "always on" inference
│ └── Use case: Client-facing API servers
Electricity Cost Comparison (24/7 operation):
├── RTX 3090: 350W × 720h × PKR 35/kWh = PKR 8,820/month
├── RTX 3060: 170W × 720h × PKR 35/kWh = PKR 4,284/month
└── M2 Mac Mini: 20W × 720h × PKR 35/kWh = PKR 504/month
Practice Lab
Exercise 1: Identify Your Hardware Baseline
Open Task Manager (Windows) or run nvidia-smi in terminal. Record your exact GPU model, VRAM capacity, and current memory clock speed. Use the formula from this lesson to calculate the maximum model size you can run at Q4_K_M with a 4k context window. Write it down — this is your hardware ceiling.
Exercise 2: VRAM Calculation Worksheet For each of the following scenarios, calculate the VRAM needed:
- Phi-3 Mini (3.8B) at Q4_K_M, 8k context window
- Mistral 7B (7B) at Q5_K_M, 4k context window
- Llama 3 8B (8B) at Q8_0, 2k context window Compare your answers against the actual VRAM usage reported by LM Studio after loading each model.
Exercise 3: The OLX Research Mission Visit olx.com.pk and search for used GPUs in your city (Karachi, Lahore, or Islamabad). Find the best PKR-per-GB-VRAM ratio currently available. Compare: RTX 3060 12GB vs RTX 3090 24GB vs RTX 4060 8GB. Which offers the best value for local LLM inference right now? Write a 3-sentence justification.
Pakistan Case Study
Scenario: Bilal Hussain, a software engineer in Karachi's PECHS neighborhood, wants to build a private lead scoring bot for his digital agency. He processes 500 restaurant leads per day from Google Maps data.
His hardware budget: PKR 75,000 (one-time purchase)
His analysis:
- Lead scoring input: business name + website + Google rating = ~200 tokens per lead
- Required context window: 512 tokens (plenty of headroom)
- Required TPS: 500 leads × 200 output tokens / 8 hours = 3,472 tokens/minute = 58 TPS minimum
His options evaluated:
| Option | GPU | VRAM | PKR Cost | TPS at Q4 | Monthly Elec. | Verdict |
|---|---|---|---|---|---|---|
| A | GTX 1660 Ti 6GB | 6 GB | PKR 38,000 | 20 TPS | PKR 3,200 | Too slow |
| B | RTX 3060 12GB (new) | 12 GB | PKR 78,000 | 38 TPS | PKR 4,300 | Over budget |
| C | RTX 3060 12GB (used OLX) | 12 GB | PKR 58,000 | 38 TPS | PKR 4,300 | Best option |
| D | RTX 3090 24GB (used OLX) | 24 GB | PKR 145,000 | 65 TPS | PKR 8,800 | Overkill |
His decision: Used RTX 3060 from OLX Lahore, shipped to Karachi for PKR 58,000 plus PKR 1,500 courier. He runs Llama 3 8B Q4_K_M at 512-token context. Processes his 500 leads in under 90 minutes at 38 TPS.
His ROI calculation: Previous Claude API cost for 500 leads/day at ~400 tokens each = 200,000 tokens × PKR 0.0003 = PKR 60/day = PKR 1,800/month. Hardware pays back in 33 months on API savings alone. But he also pitches "100% private — data never leaves Pakistan" to enterprise clients, which lets him charge PKR 15,000/month instead of PKR 5,000/month for the same service.
Bilal ka feedback: "Pehle mujhe lagta tha GPU expensive hai. Lekin jab maine calculate kiya toh pata chala 8 mahine mein ROI aa jata hai. Aur clients ko 'private AI' pitch karna bohot easy hai — woh security ko bahut value dete hain."
Key Takeaways
- VRAM is the hard bottleneck for LLM inference — model weights plus KV cache must fit entirely in GPU VRAM for acceptable performance
- Overflow to system RAM causes a 10-50x speed penalty, making the model effectively unusable for production automation
- The VRAM formula: (Parameters × 0.55 bytes at Q4) + (context_tokens × 0.5 MB per 1k) + 0.5 GB overhead
- Q4_K_M is the sweet spot: 75% VRAM savings vs FP16, only 1-2% quality loss — default choice for Pakistani hardware
- KV cache grows linearly with context length — never exceed 80% of available VRAM or TPS degrades sharply
- The RTX 3060 12GB (PKR 58,000-85,000) is the best value GPU in Pakistan for LLM work in 2026
- A used RTX 3090 from ex-crypto miners on OLX offers the best PKR-per-GB-VRAM ratio if your budget allows
- Apple M2 Mac Mini (16 GB unified) is the best choice for silent 24/7 servers due to 10W idle power draw
- Always check mining wear on used GPUs (GPU-Z stress test + artifact check before purchase)
- "Data never leaves Pakistan" is a premium pitch — enterprise clients pay 3-5x more for on-premise AI solutions
Lesson Summary
Quiz: GPU VRAM vs. System RAM: The Inference Engine
5 questions to test your understanding. Score 60% or higher to pass.