GPU VRAM vs. System RAM: The Inference Engine

In local LLM deployment, your GPU's VRAM (Video RAM) is the single most important hardware constraint. Every model you run, every context window you open, every concurrent request you serve — all of it lives and dies by VRAM. In this lesson we break down the full memory hierarchy, the math behind VRAM calculations, and how to make smart hardware decisions in the Pakistani market where GPU prices run 30-50% above US retail.

The Memory Hierarchy

When a model generates tokens, it needs to access its weights on every forward pass. The speed at which it can read those weights determines your Tokens Per Second (TPS). Here is the full hierarchy from fastest to slowest:

code

MEMORY HIERARCHY FOR LLM INFERENCE
====================================

Tier 1 — GPU VRAM (Fastest)
├── Speed:      900 GB/s - 3.35 TB/s (RTX 3090 to H100)
├── Latency:    ~100 nanoseconds
├── Capacity:   6 GB (RTX 3050) to 80 GB (A100)
├── Role:       Model weights + KV cache MUST live here
└── Cost (PK):  RTX 3060 12GB = PKR 65,000-85,000

Tier 2 — Unified Memory (Apple Silicon)
├── Speed:      200-400 GB/s (M2 Ultra: 800 GB/s)
├── Latency:    ~150 nanoseconds
├── Capacity:   8 GB to 192 GB (M2 Ultra)
├── Role:       Shared CPU+GPU pool — run 70B on Mac Studio
└── Cost (PK):  M2 Mac Mini 16GB = PKR 175,000-200,000

Tier 3 — System RAM (Fallback / Slow)
├── Speed:      50-100 GB/s (DDR5)
├── Latency:    ~70 nanoseconds (but VRAM bandwidth gap kills TPS)
├── Capacity:   8 GB to 128 GB
├── Role:       Overflow when model > VRAM. 10-50x speed penalty.
└── Use case:   CPU-only inference (last resort)

Tier 4 — NVMe SSD (Emergency Swap)
├── Speed:      5-7 GB/s (PCIe 4.0)
├── Latency:    ~100 microseconds
├── Role:       Model layers streamed from disk (llama.cpp --mmap)
└── Verdict:    Extremely slow. Only for experimentation.

The Overflow Penalty in Numbers

When a model spills from VRAM to system RAM on an RTX 3060 paired with DDR5 RAM:

Memory Region	Bandwidth	TPS (Llama 3 8B Q4)	Penalty
Full VRAM (12 GB)	360 GB/s	38 TPS	Baseline
8 GB VRAM + 4 GB RAM	360 / 50 GB/s mixed	12 TPS	3x slower
Full System RAM	50 GB/s	4 TPS	10x slower
NVMe streaming	5 GB/s	0.4 TPS	95x slower

The takeaway: a model that does not fully fit in VRAM is not just slightly slower — it is unusable for production automation.

VRAM Calculation: The Exact Formula

code

VRAM CALCULATION FOR GGUF MODELS
==================================

Formula:
  VRAM_needed = (Parameters × bits_per_weight / 8) + KV_cache_size

Where:
  KV_cache_size = (2 × n_layers × n_heads × head_dim × context_tokens × bytes_per_element)

Simplified approximation:
  KV_cache ≈ context_tokens × 0.5 MB per 1k tokens (for 8B model at Q4)

WORKED EXAMPLES:

Model: Llama 3 8B at Q4_K_M
├── Weights: 8B × 0.55 bytes/param = 4.4 GB
├── KV cache at 2k context: 2,048 × 0.5MB/1k = 1.0 GB
├── KV cache at 8k context: 8,192 × 0.5MB/1k = 4.0 GB
├── Overhead (activations, buffers): 0.5 GB
│
├── 2k context total: 4.4 + 1.0 + 0.5 = 5.9 GB  → fits RTX 3050 8GB
└── 8k context total: 4.4 + 4.0 + 0.5 = 8.9 GB  → needs RTX 3060 12GB

Model: Llama 3 70B at Q4_K_M
├── Weights: 70B × 0.55 bytes/param = 38.5 GB
├── KV cache at 4k context: 4,096 × 2.5MB/1k = 10.2 GB
├── Total: ~49 GB → needs RTX 3090 × 2 or A100 80GB

Model: Phi-3 Mini 3.8B at Q4_K_M
├── Weights: 3.8B × 0.55 bytes/param = 2.1 GB
├── KV cache at 4k context: ~2.0 GB
├── Total: ~4.5 GB → fits GTX 1660 Ti 6GB
└── Verdict: Best model for budget Pakistani laptops

Quantization Impact on VRAM

Different GGUF quantization formats yield different model sizes for the same parameter count:

Quantization	Bits/Weight	Llama 3 8B Size	VRAM for 4k ctx	Quality Loss
FP16	16	16.0 GB	22 GB	None (reference)
Q8_0	8	8.0 GB	12 GB	Negligible
Q6_K	6	6.1 GB	10 GB	< 0.5%
Q5_K_M	5	5.3 GB	9 GB	< 1%
Q4_K_M	4	4.4 GB	8 GB	1-2% (sweet spot)
Q3_K_M	3	3.5 GB	7 GB	3-5%
Q2_K	2	2.8 GB	6 GB	8-12%

The K_M suffix indicates "K-quant medium" — these are Hugging Face's improved quantization variants that use different bit widths for different weight matrices (attention layers get more bits than feed-forward layers). They outperform plain Qx_0 variants at the same nominal bit-count.

KV Cache Growth: The Context Tax

The KV cache is the memory structure that stores the model's "attention states" — essentially the model's working memory for the current conversation. Unlike model weights (which are fixed), the KV cache grows with every token you generate.

code

KV CACHE GROWTH VISUALIZATION
================================

RTX 3060 (12 GB VRAM) running Llama 3 8B Q4:

Available VRAM:             12.0 GB
Model weights:             - 4.4 GB
                           ---------
Free for KV cache:          7.6 GB

Context tokens:   KV cache size:   TPS estimate:   Status:
1,024 tokens      0.5 GB           42 TPS          Excellent
2,048 tokens      1.0 GB           38 TPS          Good
4,096 tokens      2.0 GB           32 TPS          Good
8,192 tokens      4.0 GB           22 TPS          Acceptable
16,384 tokens     8.0 GB           OVERFLOW         ← spills to RAM
32,768 tokens     16.0 GB          OVERFLOW         ← unusable

Rule: Never fill more than 80% of available VRAM with KV cache.
Safety limit on 12 GB card: 7.6 GB × 0.8 = ~6 GB for KV = ~12k tokens max

Pakistan GPU Market Reality

The Pakistani hardware market has specific dynamics that affect your purchasing decisions:

code

GPU BUYING GUIDE — PAKISTAN 2026
==================================

Budget Tier (PKR 30,000-65,000):
├── GTX 1660 Ti / RTX 2060 (6 GB VRAM)
│   ├── From: OLX Karachi, OLX Lahore, Hafeez Centre Islamabad
│   ├── Models: Phi-3 Mini, Qwen 2.5 1.5B, TinyLlama
│   └── Use case: Lead scoring, keyword extraction, classification

Mid-Range Tier (PKR 65,000-100,000):
├── RTX 3060 12 GB (BEST VALUE in Pakistan)
│   ├── From: Hafeez Centre Lahore, Saddar Karachi (new)
│   │       OLX (used — check mining wear)
│   ├── Models: Llama 3 8B Q4, Mistral 7B, Qwen 2.5 7B
│   └── Use case: Full agency automation stack

├── RTX 4060 8 GB
│   ├── Price premium: 20% over 3060 for less VRAM
│   ├── Pro: Better power efficiency (Ada architecture)
│   └── Verdict: Worse value than 3060 for LLM work

Pro Tier (PKR 130,000-220,000):
├── RTX 3090 24 GB (OLX used — ex-crypto miners)
│   ├── Check: GPU-Z artifact test before buying
│   ├── Models: Llama 3 70B Q4, Mixtral 8x7B
│   └── Use case: Production agency server

├── M2 Mac Mini 16 GB Unified
│   ├── Silent, 10W idle vs 350W for RTX 3090
│   ├── Best for 24/7 "always on" inference
│   └── Use case: Client-facing API servers

Electricity Cost Comparison (24/7 operation):
├── RTX 3090: 350W × 720h × PKR 35/kWh = PKR 8,820/month
├── RTX 3060: 170W × 720h × PKR 35/kWh = PKR 4,284/month
└── M2 Mac Mini: 20W × 720h × PKR 35/kWh = PKR 504/month

Practice Lab

Exercise 1: Identify Your Hardware Baseline Open Task Manager (Windows) or run nvidia-smi in terminal. Record your exact GPU model, VRAM capacity, and current memory clock speed. Use the formula from this lesson to calculate the maximum model size you can run at Q4_K_M with a 4k context window. Write it down — this is your hardware ceiling.

Exercise 2: VRAM Calculation Worksheet For each of the following scenarios, calculate the VRAM needed:

Phi-3 Mini (3.8B) at Q4_K_M, 8k context window
Mistral 7B (7B) at Q5_K_M, 4k context window
Llama 3 8B (8B) at Q8_0, 2k context window Compare your answers against the actual VRAM usage reported by LM Studio after loading each model.

Exercise 3: The OLX Research Mission Visit olx.com.pk and search for used GPUs in your city (Karachi, Lahore, or Islamabad). Find the best PKR-per-GB-VRAM ratio currently available. Compare: RTX 3060 12GB vs RTX 3090 24GB vs RTX 4060 8GB. Which offers the best value for local LLM inference right now? Write a 3-sentence justification.

Pakistan Case Study

Scenario: Bilal Hussain, a software engineer in Karachi's PECHS neighborhood, wants to build a private lead scoring bot for his digital agency. He processes 500 restaurant leads per day from Google Maps data.

His hardware budget: PKR 75,000 (one-time purchase)

His analysis:

Lead scoring input: business name + website + Google rating = ~200 tokens per lead
Required context window: 512 tokens (plenty of headroom)
Required TPS: 500 leads × 200 output tokens / 8 hours = 3,472 tokens/minute = 58 TPS minimum

His options evaluated:

Option	GPU	VRAM	PKR Cost	TPS at Q4	Monthly Elec.	Verdict
A	GTX 1660 Ti 6GB	6 GB	PKR 38,000	20 TPS	PKR 3,200	Too slow
B	RTX 3060 12GB (new)	12 GB	PKR 78,000	38 TPS	PKR 4,300	Over budget
C	RTX 3060 12GB (used OLX)	12 GB	PKR 58,000	38 TPS	PKR 4,300	Best option
D	RTX 3090 24GB (used OLX)	24 GB	PKR 145,000	65 TPS	PKR 8,800	Overkill

His decision: Used RTX 3060 from OLX Lahore, shipped to Karachi for PKR 58,000 plus PKR 1,500 courier. He runs Llama 3 8B Q4_K_M at 512-token context. Processes his 500 leads in under 90 minutes at 38 TPS.

His ROI calculation: Previous Claude API cost for 500 leads/day at ~400 tokens each = 200,000 tokens × PKR 0.0003 = PKR 60/day = PKR 1,800/month. Hardware pays back in 33 months on API savings alone. But he also pitches "100% private — data never leaves Pakistan" to enterprise clients, which lets him charge PKR 15,000/month instead of PKR 5,000/month for the same service.

Bilal ka feedback: "Pehle mujhe lagta tha GPU expensive hai. Lekin jab maine calculate kiya toh pata chala 8 mahine mein ROI aa jata hai. Aur clients ko 'private AI' pitch karna bohot easy hai — woh security ko bahut value dete hain."

Key Takeaways

VRAM is the hard bottleneck for LLM inference — model weights plus KV cache must fit entirely in GPU VRAM for acceptable performance
Overflow to system RAM causes a 10-50x speed penalty, making the model effectively unusable for production automation
The VRAM formula: (Parameters × 0.55 bytes at Q4) + (context_tokens × 0.5 MB per 1k) + 0.5 GB overhead
Q4_K_M is the sweet spot: 75% VRAM savings vs FP16, only 1-2% quality loss — default choice for Pakistani hardware
KV cache grows linearly with context length — never exceed 80% of available VRAM or TPS degrades sharply
The RTX 3060 12GB (PKR 58,000-85,000) is the best value GPU in Pakistan for LLM work in 2026
A used RTX 3090 from ex-crypto miners on OLX offers the best PKR-per-GB-VRAM ratio if your budget allows
Apple M2 Mac Mini (16 GB unified) is the best choice for silent 24/7 servers due to 10W idle power draw
Always check mining wear on used GPUs (GPU-Z stress test + artifact check before purchase)
"Data never leaves Pakistan" is a premium pitch — enterprise clients pay 3-5x more for on-premise AI solutions

1.1 — GPU VRAM vs. System RAM

GPU VRAM vs. System RAM: The Inference Engine

The Memory Hierarchy

The Overflow Penalty in Numbers

VRAM Calculation: The Exact Formula

Quantization Impact on VRAM

KV Cache Growth: The Context Tax

Pakistan GPU Market Reality

Practice Lab

Pakistan Case Study

Key Takeaways

Lesson Summary

Quiz: GPU VRAM vs. System RAM: The Inference Engine