5.1 — vLLM & TGI — Production Serving Frameworks
vLLM & TGI — Production Serving Frameworks
Running a model locally on your laptop with Ollama is great for development. But when your Pakistani startup needs to serve 500 WhatsApp messages per second, or your agency has five enterprise clients hammering the same endpoint simultaneously, you need a production serving framework. vLLM and Text Generation Inference (TGI) are the two dominant open-source solutions — and understanding the difference will determine whether your LLM service stays online or crashes under load.
The Core Problem: Naive Inference Is Slow
When you naively call model.generate() in PyTorch, each request is processed one at a time. The GPU sits idle between requests. Memory is not efficiently shared between concurrent users. At even modest traffic levels — 10 concurrent users — latency spikes to 30-60 seconds per request. That's unusable.
Production serving frameworks solve this through two key techniques: continuous batching and PagedAttention.
Continuous Batching
Traditional batching waits for a batch of requests to arrive before processing them together. Continuous batching processes requests as they arrive, dynamically adding new requests to the in-flight batch as soon as a slot opens. This keeps GPU utilization above 90% even with variable request rates — vs. the 40-60% you get with naive inference.
PagedAttention (vLLM's Secret Weapon)
The KV-cache (key-value cache) is the memory structure that stores the model's "attention state" while generating tokens. In naive implementations, this cache is allocated as a large contiguous block per request — wasteful, since different requests generate different numbers of tokens.
vLLM's PagedAttention manages the KV-cache like virtual memory in an OS: in small, non-contiguous pages. This eliminates memory fragmentation and allows vLLM to serve 2-4x more concurrent requests on the same GPU compared to naive implementations.
vLLM Setup
vLLM is the fastest inference engine for Nvidia GPUs. Installation is straightforward:
pip install vllm
# Start a server (OpenAI-compatible API)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--port 8000
This launches an OpenAI-compatible API endpoint. Your existing code that calls openai.ChatCompletion.create() works without modification — just change the base_url to http://localhost:8000/v1.
For serving a LoRA-adapted model (from Module 4):
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B \
--enable-lora \
--lora-modules karachi-bot=./karachi-llm-v1-adapter \
--port 8000
Text Generation Inference (TGI)
TGI from Hugging Face is the alternative — slightly more conservative in memory management but excellent Docker support and native integration with Hugging Face Hub. For teams already using Docker/Kubernetes in production:
docker run --gpus all \
-p 8080:80 \
-v $PWD/models:/data \
ghcr.io/huggingface/text-generation-inference \
--model-id meta-llama/Meta-Llama-3-8B \
--max-input-length 2048 \
--max-total-tokens 4096
vLLM vs TGI — Which to Choose
| Factor | vLLM | TGI |
|---|---|---|
| Raw throughput | Higher (PagedAttention) | Slightly lower |
| Docker support | Good | Excellent |
| LoRA serving | Native multi-adapter | Requires workaround |
| Windows support | Limited (WSL2 needed) | Better via Docker |
| Community activity | Very active | Active |
For most Pakistani developers deploying on a Linux VPS (Hetzner, DigitalOcean, or local data centers), vLLM is the recommended choice. For Windows-based development environments, run vLLM inside WSL2 or use Docker with TGI.
Benchmarking Your Deployment
Before going live, benchmark your serving setup. The key metrics are:
- Throughput: tokens per second (TPS) across all concurrent users
- Time to First Token (TTFT): how quickly the first token arrives (perceived responsiveness)
- P99 latency: the worst-case response time at the 99th percentile
Use the built-in vLLM benchmark script:
python -m vllm.entrypoints.openai.simple_benchmark \
--endpoint http://localhost:8000/v1 \
--model meta-llama/Meta-Llama-3-8B \
--num-prompts 100 \
--concurrency 10
A healthy setup on an RTX 3090 (24 GB VRAM) should achieve 800-1,500 TPS for a 7B model with 10 concurrent users.
Practice Lab
-
Install vLLM in WSL2 (Windows) or directly on Linux. Run the smallest available model (try
Qwen/Qwen2-0.5Bto avoid VRAM constraints) as a server and query it with a simple Pythonrequestscall to the OpenAI-compatible endpoint. -
Compare latency: Run 10 sequential requests with naive
model.generate()and 10 concurrent requests via the vLLM server. Measure total time for each approach and calculate the speedup. -
Test the LoRA serving: If you completed the Module 4 training exercise, serve your trained adapter via vLLM's
--enable-loraflag and verify the model responds with the fine-tuned behavior.
Key Takeaways
- Naive inference is unsuitable for production — continuous batching and PagedAttention are essential at any real traffic level
- vLLM delivers 2-4x higher throughput than naive inference through PagedAttention KV-cache management
- vLLM's OpenAI-compatible API means zero client-side code changes — just swap the base URL
- Benchmark TTFT and P99 latency before launch — 800-1,500 TPS is achievable on RTX 3090 for 7B models
Lesson Summary
Quiz: vLLM & TGI — Production Serving Frameworks
4 questions to test your understanding. Score 60% or higher to pass.