AI Infrastructure & Local LLMsModule 5

5.1vLLM & TGI — Production Serving Frameworks

30 min 4 code blocks Practice Lab Quiz (4Q)

vLLM & TGI — Production Serving Frameworks

Running a model locally on your laptop with Ollama is great for development. But when your Pakistani startup needs to serve 500 WhatsApp messages per second, or your agency has five enterprise clients hammering the same endpoint simultaneously, you need a production serving framework. vLLM and Text Generation Inference (TGI) are the two dominant open-source solutions — and understanding the difference will determine whether your LLM service stays online or crashes under load.

The Core Problem: Naive Inference Is Slow

When you naively call model.generate() in PyTorch, each request is processed one at a time. The GPU sits idle between requests. Memory is not efficiently shared between concurrent users. At even modest traffic levels — 10 concurrent users — latency spikes to 30-60 seconds per request. That's unusable.

Production serving frameworks solve this through two key techniques: continuous batching and PagedAttention.

Continuous Batching

Traditional batching waits for a batch of requests to arrive before processing them together. Continuous batching processes requests as they arrive, dynamically adding new requests to the in-flight batch as soon as a slot opens. This keeps GPU utilization above 90% even with variable request rates — vs. the 40-60% you get with naive inference.

PagedAttention (vLLM's Secret Weapon)

The KV-cache (key-value cache) is the memory structure that stores the model's "attention state" while generating tokens. In naive implementations, this cache is allocated as a large contiguous block per request — wasteful, since different requests generate different numbers of tokens.

vLLM's PagedAttention manages the KV-cache like virtual memory in an OS: in small, non-contiguous pages. This eliminates memory fragmentation and allows vLLM to serve 2-4x more concurrent requests on the same GPU compared to naive implementations.

vLLM Setup

vLLM is the fastest inference engine for Nvidia GPUs. Installation is straightforward:

bash
pip install vllm

# Start a server (OpenAI-compatible API)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --port 8000

This launches an OpenAI-compatible API endpoint. Your existing code that calls openai.ChatCompletion.create() works without modification — just change the base_url to http://localhost:8000/v1.

For serving a LoRA-adapted model (from Module 4):

bash
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B \
    --enable-lora \
    --lora-modules karachi-bot=./karachi-llm-v1-adapter \
    --port 8000

Text Generation Inference (TGI)

TGI from Hugging Face is the alternative — slightly more conservative in memory management but excellent Docker support and native integration with Hugging Face Hub. For teams already using Docker/Kubernetes in production:

bash
docker run --gpus all \
    -p 8080:80 \
    -v $PWD/models:/data \
    ghcr.io/huggingface/text-generation-inference \
    --model-id meta-llama/Meta-Llama-3-8B \
    --max-input-length 2048 \
    --max-total-tokens 4096

vLLM vs TGI — Which to Choose

FactorvLLMTGI
Raw throughputHigher (PagedAttention)Slightly lower
Docker supportGoodExcellent
LoRA servingNative multi-adapterRequires workaround
Windows supportLimited (WSL2 needed)Better via Docker
Community activityVery activeActive

For most Pakistani developers deploying on a Linux VPS (Hetzner, DigitalOcean, or local data centers), vLLM is the recommended choice. For Windows-based development environments, run vLLM inside WSL2 or use Docker with TGI.

Benchmarking Your Deployment

Before going live, benchmark your serving setup. The key metrics are:

  • Throughput: tokens per second (TPS) across all concurrent users
  • Time to First Token (TTFT): how quickly the first token arrives (perceived responsiveness)
  • P99 latency: the worst-case response time at the 99th percentile

Use the built-in vLLM benchmark script:

bash
python -m vllm.entrypoints.openai.simple_benchmark \
    --endpoint http://localhost:8000/v1 \
    --model meta-llama/Meta-Llama-3-8B \
    --num-prompts 100 \
    --concurrency 10

A healthy setup on an RTX 3090 (24 GB VRAM) should achieve 800-1,500 TPS for a 7B model with 10 concurrent users.

Practice Lab

Practice Lab

  1. Install vLLM in WSL2 (Windows) or directly on Linux. Run the smallest available model (try Qwen/Qwen2-0.5B to avoid VRAM constraints) as a server and query it with a simple Python requests call to the OpenAI-compatible endpoint.

  2. Compare latency: Run 10 sequential requests with naive model.generate() and 10 concurrent requests via the vLLM server. Measure total time for each approach and calculate the speedup.

  3. Test the LoRA serving: If you completed the Module 4 training exercise, serve your trained adapter via vLLM's --enable-lora flag and verify the model responds with the fine-tuned behavior.

Key Takeaways

  • Naive inference is unsuitable for production — continuous batching and PagedAttention are essential at any real traffic level
  • vLLM delivers 2-4x higher throughput than naive inference through PagedAttention KV-cache management
  • vLLM's OpenAI-compatible API means zero client-side code changes — just swap the base URL
  • Benchmark TTFT and P99 latency before launch — 800-1,500 TPS is achievable on RTX 3090 for 7B models

Lesson Summary

Includes hands-on practice lab4 runnable code examples4-question knowledge check below

Quiz: vLLM & TGI — Production Serving Frameworks

4 questions to test your understanding. Score 60% or higher to pass.