vLLM & TGI — Production Serving Frameworks

Running a model locally on your laptop with Ollama is great for development. But when your Pakistani startup needs to serve 500 WhatsApp messages per second, or your agency has five enterprise clients hammering the same endpoint simultaneously, you need a production serving framework. vLLM and Text Generation Inference (TGI) are the two dominant open-source solutions — and understanding the difference will determine whether your LLM service stays online or crashes under load.

The Core Problem: Naive Inference Is Slow

When you naively call model.generate() in PyTorch, each request is processed one at a time. The GPU sits idle between requests. Memory is not efficiently shared between concurrent users. At even modest traffic levels — 10 concurrent users — latency spikes to 30-60 seconds per request. That's unusable.

Production serving frameworks solve this through two key techniques: continuous batching and PagedAttention.

Continuous Batching

Traditional batching waits for a batch of requests to arrive before processing them together. Continuous batching processes requests as they arrive, dynamically adding new requests to the in-flight batch as soon as a slot opens. This keeps GPU utilization above 90% even with variable request rates — vs. the 40-60% you get with naive inference.

PagedAttention (vLLM's Secret Weapon)

The KV-cache (key-value cache) is the memory structure that stores the model's "attention state" while generating tokens. In naive implementations, this cache is allocated as a large contiguous block per request — wasteful, since different requests generate different numbers of tokens.

vLLM's PagedAttention manages the KV-cache like virtual memory in an OS: in small, non-contiguous pages. This eliminates memory fragmentation and allows vLLM to serve 2-4x more concurrent requests on the same GPU compared to naive implementations.

vLLM Setup

vLLM is the fastest inference engine for Nvidia GPUs. Installation is straightforward:

bash

pip install vllm

# Start a server (OpenAI-compatible API)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --port 8000

This launches an OpenAI-compatible API endpoint. Your existing code that calls openai.ChatCompletion.create() works without modification — just change the base_url to http://localhost:8000/v1.

For serving a LoRA-adapted model (from Module 4):

bash

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B \
    --enable-lora \
    --lora-modules karachi-bot=./karachi-llm-v1-adapter \
    --port 8000

Text Generation Inference (TGI)

TGI from Hugging Face is the alternative — slightly more conservative in memory management but excellent Docker support and native integration with Hugging Face Hub. For teams already using Docker/Kubernetes in production:

bash

docker run --gpus all \
    -p 8080:80 \
    -v $PWD/models:/data \
    ghcr.io/huggingface/text-generation-inference \
    --model-id meta-llama/Meta-Llama-3-8B \
    --max-input-length 2048 \
    --max-total-tokens 4096

vLLM vs TGI — Which to Choose

Factor	vLLM	TGI
Raw throughput	Higher (PagedAttention)	Slightly lower
Docker support	Good	Excellent
LoRA serving	Native multi-adapter	Requires workaround
Windows support	Limited (WSL2 needed)	Better via Docker
Community activity	Very active	Active

For most Pakistani developers deploying on a Linux VPS (Hetzner, DigitalOcean, or local data centers), vLLM is the recommended choice. For Windows-based development environments, run vLLM inside WSL2 or use Docker with TGI.

Benchmarking Your Deployment

Before going live, benchmark your serving setup. The key metrics are:

Throughput: tokens per second (TPS) across all concurrent users
Time to First Token (TTFT): how quickly the first token arrives (perceived responsiveness)
P99 latency: the worst-case response time at the 99th percentile

Use the built-in vLLM benchmark script:

bash

python -m vllm.entrypoints.openai.simple_benchmark \
    --endpoint http://localhost:8000/v1 \
    --model meta-llama/Meta-Llama-3-8B \
    --num-prompts 100 \
    --concurrency 10

A healthy setup on an RTX 3090 (24 GB VRAM) should achieve 800-1,500 TPS for a 7B model with 10 concurrent users.

Practice Lab

Install vLLM in WSL2 (Windows) or directly on Linux. Run the smallest available model (try Qwen/Qwen2-0.5B to avoid VRAM constraints) as a server and query it with a simple Python requests call to the OpenAI-compatible endpoint.
Compare latency: Run 10 sequential requests with naive model.generate() and 10 concurrent requests via the vLLM server. Measure total time for each approach and calculate the speedup.
Test the LoRA serving: If you completed the Module 4 training exercise, serve your trained adapter via vLLM's --enable-lora flag and verify the model responds with the fine-tuned behavior.

Key Takeaways

Naive inference is unsuitable for production — continuous batching and PagedAttention are essential at any real traffic level
vLLM delivers 2-4x higher throughput than naive inference through PagedAttention KV-cache management
vLLM's OpenAI-compatible API means zero client-side code changes — just swap the base URL
Benchmark TTFT and P99 latency before launch — 800-1,500 TPS is achievable on RTX 3090 for 7B models

5.1 — vLLM & TGI — Production Serving Frameworks

vLLM & TGI — Production Serving Frameworks

The Core Problem: Naive Inference Is Slow

Continuous Batching

PagedAttention (vLLM's Secret Weapon)

vLLM Setup

Text Generation Inference (TGI)

vLLM vs TGI — Which to Choose

Benchmarking Your Deployment

Practice Lab

Key Takeaways

Lesson Summary

Quiz: vLLM & TGI — Production Serving Frameworks