6.1 — Docker for AI — Containerizing Models & APIs
Docker for AI — Containerizing Models & APIs
You built a working AI model on your laptop. You send it to your teammate. It doesn't work. "Works on my machine" is the oldest problem in software — and Docker is the solution. Docker packages your AI model, its dependencies, and the entire runtime environment into a single portable container that runs identically everywhere. This lesson teaches you to containerize AI models and APIs for production deployment.
Why Docker for AI?
The Dependency Nightmare
A typical AI project requires:
- Python 3.11 (not 3.12, not 3.10 — exactly 3.11)
- PyTorch 2.2.1 with CUDA 12.1 support
- transformers, tokenizers, safetensors, accelerate (all pinned versions)
- System libraries: libcudnn, libcublas, libnccl
- Model weights: 4-14GB binary files
Without Docker, every deployment is a dice roll. With Docker, you ship the exact working environment.
┌─────────────────────────────────────────┐
│ WITHOUT DOCKER │
│ Dev laptop → "works" ✓ │
│ Server → wrong Python → ✗ │
│ Client machine → missing CUDA → ✗ │
│ Teammate → different PyTorch → ✗ │
├─────────────────────────────────────────┤
│ WITH DOCKER │
│ Dev laptop → container → ✓ │
│ Server → same container → ✓ │
│ Client machine → same container → ✓ │
│ Teammate → same container → ✓ │
└─────────────────────────────────────────┘
Docker Fundamentals for AI Engineers
Key Concepts
| Concept | What It Is | AI Example |
|---|---|---|
| Image | Blueprint/template | "Python 3.11 + PyTorch + my model" |
| Container | Running instance of an image | Your API processing requests right now |
| Dockerfile | Build instructions | Recipe to create the image |
| Volume | Persistent storage | Model weights (don't bake 14GB into the image) |
| Port mapping | Network access | Expose port 8000 for your FastAPI |
| Registry | Image storage | Docker Hub, GitHub Container Registry |
Installing Docker
Windows (with WSL2):
# Install Docker Desktop from docker.com
# Enable WSL2 backend in settings
# For GPU support: install NVIDIA Container Toolkit
Ubuntu/Debian VPS:
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
# For GPU: install nvidia-container-toolkit
sudo apt install nvidia-container-toolkit
sudo systemctl restart docker
Your First AI Dockerfile
Example: Containerizing a Llama 3 API
# Base image with CUDA support
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
# Install Python
RUN apt-get update && apt-get install -y \
python3.11 python3-pip \
&& rm -rf /var/lib/apt/lists/*
# Set working directory
WORKDIR /app
# Install Python dependencies first (cached layer)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Expose the API port
EXPOSE 8000
# Health check
HEALTHCHECK --interval=30s --timeout=10s \
CMD curl -f http://localhost:8000/health || exit 1
# Start the API server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
The requirements.txt
fastapi==0.109.0
uvicorn[standard]==0.27.0
transformers==4.37.0
torch==2.2.1
accelerate==0.27.0
safetensors==0.4.2
Building and Running
# Build the image
docker build -t my-llm-api:v1 .
# Run WITHOUT GPU
docker run -p 8000:8000 my-llm-api:v1
# Run WITH GPU (NVIDIA)
docker run --gpus all -p 8000:8000 my-llm-api:v1
# Run with model weights mounted from host
docker run --gpus all \
-p 8000:8000 \
-v /home/models/llama3:/app/models \
my-llm-api:v1
Dockerfile Best Practices for AI
Multi-Stage Builds (Smaller Images)
AI images can be 10-20GB. Multi-stage builds cut that down:
# Stage 1: Build dependencies
FROM python:3.11-slim AS builder
COPY requirements.txt .
RUN pip install --prefix=/install --no-cache-dir -r requirements.txt
# Stage 2: Runtime (smaller base)
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
COPY --from=builder /install /usr/local
COPY . /app
WORKDIR /app
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Layer Caching Strategy
Docker caches each layer. Put things that change rarely at the top:
# SLOW TO CHANGE (cached)
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y python3.11 python3-pip
# MEDIUM (cached unless deps change)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# FAST TO CHANGE (rebuilt often)
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Don't Bake Model Weights into Images
Model files are 4-14GB. Baking them into the image means:
- 14GB download every time you update code
- 14GB stored in Docker registry
- Slow builds, slow deploys
Instead, use volumes:
# Mount model directory at runtime
docker run --gpus all \
-v /data/models:/app/models \
-p 8000:8000 \
my-llm-api:v1
.dockerignore
__pycache__/
*.pyc
.git/
.env
models/
*.bin
*.safetensors
node_modules/
Docker Compose for AI Stacks
Real AI deployments have multiple services. Docker Compose orchestrates them:
# docker-compose.yml
version: '3.8'
services:
llm-api:
build: ./api
ports:
- "8000:8000"
volumes:
- ./models:/app/models
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- MODEL_PATH=/app/models/llama3-8b-q4
- MAX_TOKENS=2048
redis:
image: redis:7-alpine
ports:
- "6379:6379"
nginx:
image: nginx:alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
depends_on:
- llm-api
Run the entire stack:
docker compose up -d # Start all services
docker compose logs -f # Watch logs
docker compose down # Stop everything
GPU Access in Docker
NVIDIA Container Toolkit
# Install on Ubuntu
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Verify GPU access
docker run --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
GPU Flags
--gpus all # All GPUs
--gpus '"device=0"' # Specific GPU by index
--gpus '"device=GPU-abc123"' # Specific GPU by UUID
Practice Lab
Task 1: Containerize a Simple AI API
Create a Dockerfile for a FastAPI app that loads a small model (e.g., distilbert-base-uncased for sentiment analysis). Build and run it locally. Test the /predict endpoint with curl.
Task 2: Docker Compose Stack Create a docker-compose.yml with your AI API + Redis for caching + Nginx as reverse proxy. Run all three services together.
Task 3: Optimization Challenge Take your Dockerfile from Task 1 and optimize it: use multi-stage build, proper layer ordering, and .dockerignore. Compare the image size before and after.
Pakistan Case Study
Meet Fahad — an ML engineer at a Lahore startup building an Urdu NLP API.
His problem: The API worked perfectly on his RTX 3080 laptop. Deploying to their Hetzner VPS was a 2-day nightmare every time — CUDA version mismatches, Python conflicts, missing system libraries.
His Docker solution:
- Created a single Dockerfile with pinned CUDA + Python + PyTorch versions
- Model weights stored on a persistent volume (not in the image)
- docker-compose.yml with API + Redis cache + Nginx SSL
Deployment now:
git pull && docker compose up -d --build
# Done. 3 minutes.
Results:
- Deployment time: 2 days → 3 minutes
- "Broken on server" incidents: 2-3/month → zero
- New team member onboarding: 1 day of setup →
docker compose upand they're running - His boss gave him a PKR 20,000 raise for "making deploys boring" (the best kind of boring)
Key Takeaways
- Docker eliminates "works on my machine" — your AI runs identically everywhere
- Use NVIDIA base images + Container Toolkit for GPU access in containers
- Never bake model weights into images — mount them as volumes at runtime
- Multi-stage builds cut image sizes by 50-70%
- Docker Compose orchestrates multi-service AI stacks (API + cache + proxy)
- Layer ordering matters: put rarely-changing layers first for faster builds
Next lesson: Kubernetes basics for scaling AI deployments across multiple machines.
Lesson Summary
Quiz: Docker for AI — Containerizing Models & APIs
4 questions to test your understanding. Score 60% or higher to pass.