AI Infrastructure & Local LLMsModule 1

1.2Nvidia vs. Mac Silicon Benchmarking

30 min 5 code blocks Practice Lab Quiz (5Q)

Hardware Benchmarking: Measuring Inference Performance

In local AI operations, raw clock speed is secondary to Tokens Per Second (TPS). Every rupee you save on API costs depends on how efficiently your hardware converts electricity into useful output. In this lesson, we learn how to benchmark your local hardware systematically, interpret the results, and translate raw throughput into real business decisions — expressed in PKR, not dollars.

Benchmarking is not a one-time exercise. As you load larger models, swap quantization levels, or add parallel bots to your pipeline, your performance envelope shifts. An operator who benchmarks regularly understands their machine's true capacity; one who guesses will hit bottlenecks in production at the worst possible time.

The 3 Primary Metrics

Understanding these three numbers gives you a complete picture of your inference stack's health.

1. Time to First Token (TTFT) The latency between sending a prompt and receiving the first token in the response. TTFT is dominated by the prompt-processing phase (called the "prefill" stage), where the model reads your entire input in one pass. A 512-token prompt on a 7B model typically has a TTFT of 0.3-1.5 seconds. For voicebots and real-time assistants, TTFT must stay under 1 second. For batch-processing pipelines, TTFT matters less than raw TPS.

2. Tokens Per Second (TPS) The sustained throughput during generation. This is the number that determines how many emails you can draft, how many leads you can score, and how much batch work you can push through per hour. Human reading speed is 4-5 tokens/second. A usable chatbot needs 15+ TPS. Batch processing pipelines benefit from 50+ TPS. Anything below 10 TPS on a production system is a red flag.

3. VRAM Utilization How much of your GPU memory is consumed by model weights versus the active context window (KV cache). If VRAM utilization exceeds 90%, the OS begins swapping to system RAM — and your TPS collapses by 80-95%. A healthy deployment keeps VRAM at 70-80% headroom for context spikes.

The 4th Hidden Metric: CPU-GPU Transfer Overhead

When your model weights exceed VRAM capacity, part of the model offloads to RAM. This is called "CPU offloading" and it causes silent performance degradation. A 13B model on a card with 6GB VRAM will show 8-10 TPS instead of the expected 25-30 TPS — because data is constantly shuttling across the PCIe bus.

code
PCIe Transfer Bottleneck Diagram

CPU RAM (64GB)              GPU VRAM (8GB)
┌───────────────────┐       ┌─────────────────┐
│  Layer 1-16       │──────>│  Layer 17-32    │
│  (offloaded)      │  PCIe │  (active)       │
│  ~7 GB            │  bus  │  ~5.2 GB used   │
│                   │<──────│                 │
└───────────────────┘       └─────────────────┘
         |
         | PCIe 4.0 x16: 32 GB/s (theoretical)
         | Practical: 15-20 GB/s
         |
         v
         Each offloaded layer transfer: +30-50ms latency
         32 layers × 40ms = 1.28s added to every TTFT

The fix: use a model that fits entirely in VRAM. Never half-fit a model if you have a smaller, better-fitting alternative.

Technical Snippet

Technical Snippet: TPS Calculation Logic

To measure performance in Ollama, enable verbose output before running any prompt:

bash
# In Ollama interactive session, enable verbose mode
/set verbose

# Send a generation task — 500 words is a standard benchmark
"Write a 500 word technical brief on RAG pipeline design."

# Ollama will output performance stats at the end:
# eval count:    512 tokens
# eval duration: 10.240s
# eval rate:     50.00 tokens/s   <-- this is your TPS
# prompt eval count:    47 tokens
# prompt eval duration: 0.312s
# prompt eval rate:     150.64 tokens/s  <-- this is prefill speed

# TPS formula:
# TPS = eval_count / eval_duration
# Example: 512 / 10.24 = 50 TPS

For LM Studio, TPS is displayed live in the bottom status bar during generation. No configuration needed — just watch the number during inference.

For Python-based benchmarking across multiple models:

python
import time
import httpx

def benchmark_model(model_name: str, prompt: str) -> dict:
    start = time.time()
    first_token_time = None
    token_count = 0

    with httpx.stream("POST", "http://localhost:11434/api/generate",
                      json={"model": model_name, "prompt": prompt}) as r:
        for chunk in r.iter_lines():
            if first_token_time is None:
                first_token_time = time.time()
            token_count += 1

    end = time.time()
    total_duration = end - start
    generation_duration = end - first_token_time

    return {
        "model": model_name,
        "ttft_seconds": round(first_token_time - start, 3),
        "tps": round(token_count / generation_duration, 1),
        "total_seconds": round(total_duration, 2),
    }

# Run benchmark
result = benchmark_model("llama3:8b-instruct-q4_K_M", "Write 300 words on supply chain AI.")
print(result)
# {'model': 'llama3:8b-instruct-q4_K_M', 'ttft_seconds': 0.421, 'tps': 43.7, 'total_seconds': 8.14}
Key Insight

Nuance: Thermal Throttling

Unlike standard gaming sessions (which last 30-90 minutes), LLM inference runs continuously for hours. This distinction matters critically for Pakistani laptop users.

When a GPU hits its thermal limit (85-95 degrees C), the driver reduces clock speeds by 20-50% to protect the hardware. This is called thermal throttling. The result: your TPS starts at 45, then silently drops to 22-28 after 15 minutes of continuous inference — without any error message.

code
Thermal Throttling Over Time (RTX 3060 Mobile, no cooling pad)

TPS
50 |*
45 |  * *
40 |       *  *
35 |              *
30 |                  * *
25 |                        * * * * * * *
20 |
   +---+---+---+---+---+---+---+---+---+---> Time (minutes)
   0   5   10  15  20  25  30  35  40  45

Throttle point: ~18 minutes on a laptop without cooling stand
Steady-state throttled TPS: 55-60% of peak TPS

Mitigation strategies, ranked by cost:

SolutionCost (PKR)TPS GainNotes
Laptop cooling stand + 2 fans1,500-3,000+15-25%Best value for laptops
Repaste CPU/GPU thermal compound500 (DIY)+10-20%One-time, lasts 2 years
Undervolting via MSI AfterburnerFree+8-15%Requires experimentation
External cooling pad with PWM3,000-5,000+20-30%Recommended for 24/7 ops
Desktop build (vs laptop)Cost deltaNo throttlePermanent fix

GPU Benchmark Reference: Pakistan Market 2026

code
TPS Comparison — 7B Model Q4_K_M — Llama 3 8B

GPU              VRAM    TPS     PKR Price (2026)    Verdict
─────────────────────────────────────────────────────────────
RTX 3060         12 GB   43-48   55,000-65,000       Best value
RTX 4060         8 GB    52-58   70,000-80,000       Fast, VRAM tight
RTX 3090         24 GB   38-44   110,000-130,000     Run 70B models
RTX 4070         12 GB   65-75   120,000-140,000     Speed king <150K
M2 Mac Mini      16 GB   28-35   175,000-190,000     Silent 24/7
RTX 3060 Mobile  6 GB    22-28   (in laptop)         Tight at 8B Q4
GTX 1660 Ti      6 GB    12-16   30,000-40,000       Budget, 7B only
RTX 2080 Ti      11 GB   30-36   60,000-75,000       Used, good value

Key insight for Pakistan: The RTX 3090 used market is the best PKR-to-VRAM ratio. A 24 GB card at PKR 110,000 runs 70B models that would otherwise require a PKR 400,000+ setup. Check OLX Karachi and Lahore — many crypto miners upgraded to 4090s and are offloading 3090s.

Comparison Table: Inference Backends

BackendTPS (7B Q4)ConcurrencySetup DifficultyBest For
Ollama40-50SequentialEasyDevelopment, single bots
LM Studio38-48SequentialTrivial (GUI)Testing, model discovery
vLLM55-8010-50+ parallelModerate (Linux)Production multi-bot
llama.cpp (CLI)42-52SequentialModerateCustom builds
TGI (Docker)50-6510-20 parallelModerate (Docker)Enterprise deployment
Practice Lab

Practice Lab

Task 1 — Single Model Baseline Load llama3:8b-instruct-q4_K_M in Ollama. Enable verbose mode (/set verbose). Generate exactly 500 words. Record: TTFT, TPS, VRAM usage from nvidia-smi. This is your baseline number — write it down.

Task 2 — Quantization Speed Trade-off Load the same model at q8_0 (if VRAM allows). Run the same 500-word generation. Record TPS. Calculate: what percentage did TPS drop? What percentage did output quality improve (subjectively rate 1-10)? This is your personal "fidelity vs. speed" trade-off matrix.

Task 3 — Thermal Endurance Run Run 10 consecutive 500-word generations without stopping. Record TPS for each run. Plot the numbers in a simple table. Identify the exact run where throttling begins. If runs 1-3 average 45 TPS and run 7-10 average 28 TPS, you are throttling at 38% performance loss — a real production problem.

Pakistan Case Study

Scenario: Hamid's Lahore Cold Email Agency

Hamid Raza runs a B2B cold email agency from Lahore's Gulberg district. He charges clients PKR 25,000/month for 500 personalized cold emails per month (roughly PKR 50 per email). He was using Claude API to draft emails — costing him PKR 0.003 per 1,000 tokens at ~400 tokens per email = PKR 1.2 per email = PKR 600/month in API costs. Manageable.

But when he scaled to 3 clients (1,500 emails/month) and started adding subject line variants (3 per email = 4,500 total generations), his Claude API bill jumped to PKR 5,400/month — 21% of his revenue gone.

Hamid bought a used RTX 3090 from OLX Lahore for PKR 118,000. He benchmarked it:

  • Llama 3 70B Q4_K_M: 18 TPS
  • Llama 3 8B Q4_K_M: 43 TPS
  • Processing 4,500 emails at 43 TPS, 400 tokens each: ~2.6 hours total

His electricity cost for a 3-hour run: PKR 72 (at PKR 24/kWh, RTX 3090 draws ~350W).

PKR 118,000 hardware cost divided by PKR 5,328 monthly savings (API cost minus electricity) = 22 months to break even.

"Yaar, API pe itna paise barbad ho raha tha. Ab ek baar hardware khareed liya, baaki sab free." — Hamid Raza

He then offered "private AI email drafting" as a premium service to clients who wanted their lead data to never leave Pakistan — and charged a 40% premium for it. The hardware paid for itself in 14 months, not 22.

Key Takeaways

  • TPS is the primary production metric — not model size, not parameter count. Know your hardware's TPS before building any pipeline.
  • TTFT matters for real-time applications (chatbots, voicebots). For batch processing pipelines, TTFT is almost irrelevant.
  • Thermal throttling is a silent killer — a laptop running at 25 TPS after 20 minutes of continuous inference is functionally 44% slower than its benchmark number.
  • CPU offloading destroys TPS. If a model does not fit entirely in VRAM, use a smaller or more aggressively quantized model instead.
  • The RTX 3090 used market in Pakistan (OLX, Hafeez Centre) offers the best PKR-per-VRAM ratio for running 30B-70B models.
  • A cooling stand (PKR 2,000) can recover 15-25% of throttled TPS — it is the highest-ROI hardware purchase you can make.
  • Benchmark before building. Deploy a benchmark script before committing to any model for a production workflow.
  • The electricity cost of local inference (PKR 50-200/day for a 3090 at full load) is nearly always lower than equivalent API costs at production volume.
  • Always compare "words per PKR" across local vs. cloud — not just absolute cost, but cost per unit of output.
  • Run benchmarks at thermal steady state (after 20 minutes), not on a cold machine. Cold benchmarks are optimistic and will mislead your pipeline design.

Lesson Summary

Includes hands-on practice lab5 runnable code examples5-question knowledge check below

Quiz: Hardware Benchmarking: Measuring Inference Performance

5 questions to test your understanding. Score 60% or higher to pass.