The Silicon LayerModule 1

1.2Nvidia vs. Mac Silicon Benchmarking

30 min 2 code blocks Practice Lab Homework Quiz (5Q)

Hardware Benchmarking: Measuring Inference Performance

In local AI operations, raw clock speed is secondary to Tokens Per Second (TPS). In this lesson, we learn how to benchmark your local hardware to determine its capacity for industrial-scale automation.

🏗️ The 3 Primary Metrics

  1. Time to First Token (TTFT): The latency between command and initial response. Critical for real-time voicebots.
  2. Tokens Per Second (TPS): The raw throughput. 5-10 TPS is human-readable; 50+ TPS is required for high-volume data processing.
  3. VRAM Utilization: How much of your GPU memory is consumed by the model weights vs. the context window.
Technical Snippet

Technical Snippet: TPS Calculation Logic

To measure performance in Ollama or LM Studio:

bash
# In Ollama, use the /verbose flag
/set verbose
"Write a 500 word technical brief on RAG."
# Check the output for:
# eval count: 512 tokens
# eval duration: 10.2s
# TPS = eval count / eval duration = ~50 TPS
Key Insight

Nuance: Thermal Throttling

Unlike standard gaming, LLM inference is highly intensive for long durations. If your "Laptop Server" hits 90°C, your TPS will drop by 50%. A professional setup requires active cooling (laptop stands or server racks).

Practice Lab

Practice Lab: The Multi-Model Benchmark

  1. Load: Load Llama-3-8B at Q4 quantization. Record the TPS.
  2. Scale: Load the same model at Q8 (higher fidelity). Record the TPS drop.
  3. Analyze: Determine the "Fidelity vs. Speed" trade-off for your specific hardware.

🇵🇰 Pakistan Hardware Reality

In Pakistan, GPU prices are inflated 30-50% compared to US retail. Here's the real landscape:

GPUVRAMPKR Price (2026)Best For
RTX 306012GBPKR 55,000-65,000Best value — runs 7B-13B models
RTX 40608GBPKR 70,000Good TPS but limited by 8GB
RTX 309024GBPKR 120,000 (used)Runs 30B+ models comfortably
M2 Mac Mini16GB unifiedPKR 180,000Silent, low power, great for 24/7

Pro Tip: Check OLX Karachi and Lahore for used RTX 3090s from crypto miners. Many are selling at 50% off retail. A used 3090 is the best PKR-to-VRAM ratio in Pakistan right now.

Electricity Math: Running a 3090 24/7 costs ~PKR 3,000-4,000/month in electricity. Compare that to PKR 15,000+/month for Claude API at scale. Local inference pays for itself in 2-3 months.

📺 Recommended Videos & Resources

  • Tokens Per Second (TPS) Benchmarking Guide — Real-world inference speed testing

    • Type: YouTube
    • Link description: Search for "tokens per second benchmark LLM 2024"
  • GPU Thermal Throttling Prevention — Nvidia thermal management guides

    • Type: Documentation / NVIDIA Support
    • Link description: Check NVIDIA support site for cooling and thermal management
  • Pakistani Hardware Market Analysis — Hafeez Centre (Lahore) GPU pricing

    • Type: Pakistan Market / Electronics
    • Link description: Browse Hafeez Centre Lahore for current GPU stock and pricing
  • OLX GPU Used Market (Pakistan) — Real-time pricing for used GPUs

    • Type: Local Market / Pakistan
    • Link description: Check OLX Electronics section for used RTX 3090s and gaming GPUs
  • LM Studio Performance Monitoring — Real-time TPS display and benchmarking

    • Type: Tool / Application
    • Link description: LM Studio shows live TPS and VRAM usage during inference

🎯 Mini-Challenge

Challenge: Load a 7B model in Ollama. Generate a 500-word response and record the TPS. Then load a 70B model (or skip if VRAM insufficient) and generate the same 500 words. Calculate the TPS ratio and determine which model gives better "words per dollar" value for your Pakistani budget.

Time: 5 minutes (after model download)

🖼️ Visual Reference

code
📊 TPS vs. Model Complexity
┌────────────────────────────────────────────────────┐
│ Your RTX 3090 Hardware Throughput:                 │
│                                                    │
│ Phi-3 (3.8B) at Q4      ████████████████ 80 TPS  │
│ Llama-3 (8B) at Q4      ██████████ 45 TPS         │
│ Llama-3 (70B) at Q4     ████ 15 TPS               │
│                                                    │
│ Rule: Doubling model size ≈ 1/3 the TPS           │
│                                                    │
│ 🇵🇰 PKR Impact:                                    │
│ 100k words at 80 TPS = 1,250 seconds = 21 min    │
│ 100k words at 15 TPS = 6,667 seconds = 111 min   │
│                                                    │
│ Choosing speed vs. quality = business decision    │
└────────────────────────────────────────────────────┘
Homework

Homework: The TPS Target

Identify a task that requires processing 100,000 words. Based on your current TPS, calculate how many hours it would take to finish. Propose one hardware upgrade available in the Pakistani market to cut that time in half.

Lesson Summary

Includes hands-on practice labHomework assignment included2 runnable code examples5-question knowledge check below

Quiz: Hardware Benchmarking: Measuring Inference Performance

5 questions to test your understanding. Score 60% or higher to pass.