1.2 — Nvidia vs. Mac Silicon Benchmarking
Hardware Benchmarking: Measuring Inference Performance
In local AI operations, raw clock speed is secondary to Tokens Per Second (TPS). In this lesson, we learn how to benchmark your local hardware to determine its capacity for industrial-scale automation.
🏗️ The 3 Primary Metrics
- Time to First Token (TTFT): The latency between command and initial response. Critical for real-time voicebots.
- Tokens Per Second (TPS): The raw throughput. 5-10 TPS is human-readable; 50+ TPS is required for high-volume data processing.
- VRAM Utilization: How much of your GPU memory is consumed by the model weights vs. the context window.
Technical Snippet: TPS Calculation Logic
To measure performance in Ollama or LM Studio:
# In Ollama, use the /verbose flag
/set verbose
"Write a 500 word technical brief on RAG."
# Check the output for:
# eval count: 512 tokens
# eval duration: 10.2s
# TPS = eval count / eval duration = ~50 TPS
Nuance: Thermal Throttling
Unlike standard gaming, LLM inference is highly intensive for long durations. If your "Laptop Server" hits 90°C, your TPS will drop by 50%. A professional setup requires active cooling (laptop stands or server racks).
Practice Lab: The Multi-Model Benchmark
- Load: Load
Llama-3-8Bat Q4 quantization. Record the TPS. - Scale: Load the same model at Q8 (higher fidelity). Record the TPS drop.
- Analyze: Determine the "Fidelity vs. Speed" trade-off for your specific hardware.
🇵🇰 Pakistan Hardware Reality
In Pakistan, GPU prices are inflated 30-50% compared to US retail. Here's the real landscape:
| GPU | VRAM | PKR Price (2026) | Best For |
|---|---|---|---|
| RTX 3060 | 12GB | PKR 55,000-65,000 | Best value — runs 7B-13B models |
| RTX 4060 | 8GB | PKR 70,000 | Good TPS but limited by 8GB |
| RTX 3090 | 24GB | PKR 120,000 (used) | Runs 30B+ models comfortably |
| M2 Mac Mini | 16GB unified | PKR 180,000 | Silent, low power, great for 24/7 |
Pro Tip: Check OLX Karachi and Lahore for used RTX 3090s from crypto miners. Many are selling at 50% off retail. A used 3090 is the best PKR-to-VRAM ratio in Pakistan right now.
Electricity Math: Running a 3090 24/7 costs ~PKR 3,000-4,000/month in electricity. Compare that to PKR 15,000+/month for Claude API at scale. Local inference pays for itself in 2-3 months.
📺 Recommended Videos & Resources
-
Tokens Per Second (TPS) Benchmarking Guide — Real-world inference speed testing
- Type: YouTube
- Link description: Search for "tokens per second benchmark LLM 2024"
-
GPU Thermal Throttling Prevention — Nvidia thermal management guides
- Type: Documentation / NVIDIA Support
- Link description: Check NVIDIA support site for cooling and thermal management
-
Pakistani Hardware Market Analysis — Hafeez Centre (Lahore) GPU pricing
- Type: Pakistan Market / Electronics
- Link description: Browse Hafeez Centre Lahore for current GPU stock and pricing
-
OLX GPU Used Market (Pakistan) — Real-time pricing for used GPUs
- Type: Local Market / Pakistan
- Link description: Check OLX Electronics section for used RTX 3090s and gaming GPUs
-
LM Studio Performance Monitoring — Real-time TPS display and benchmarking
- Type: Tool / Application
- Link description: LM Studio shows live TPS and VRAM usage during inference
🎯 Mini-Challenge
Challenge: Load a 7B model in Ollama. Generate a 500-word response and record the TPS. Then load a 70B model (or skip if VRAM insufficient) and generate the same 500 words. Calculate the TPS ratio and determine which model gives better "words per dollar" value for your Pakistani budget.
Time: 5 minutes (after model download)
🖼️ Visual Reference
📊 TPS vs. Model Complexity
┌────────────────────────────────────────────────────┐
│ Your RTX 3090 Hardware Throughput: │
│ │
│ Phi-3 (3.8B) at Q4 ████████████████ 80 TPS │
│ Llama-3 (8B) at Q4 ██████████ 45 TPS │
│ Llama-3 (70B) at Q4 ████ 15 TPS │
│ │
│ Rule: Doubling model size ≈ 1/3 the TPS │
│ │
│ 🇵🇰 PKR Impact: │
│ 100k words at 80 TPS = 1,250 seconds = 21 min │
│ 100k words at 15 TPS = 6,667 seconds = 111 min │
│ │
│ Choosing speed vs. quality = business decision │
└────────────────────────────────────────────────────┘
Homework: The TPS Target
Identify a task that requires processing 100,000 words. Based on your current TPS, calculate how many hours it would take to finish. Propose one hardware upgrade available in the Pakistani market to cut that time in half.
Lesson Summary
Quiz: Hardware Benchmarking: Measuring Inference Performance
5 questions to test your understanding. Score 60% or higher to pass.