Docker for AI — Containerizing Models & APIs

You built a working AI model on your laptop. You send it to your teammate. It doesn't work. "Works on my machine" is the oldest problem in software — and Docker is the solution. Docker packages your AI model, its dependencies, and the entire runtime environment into a single portable container that runs identically everywhere. This lesson teaches you to containerize AI models and APIs for production deployment.

Why Docker for AI?

The Dependency Nightmare

A typical AI project requires:

Python 3.11 (not 3.12, not 3.10 — exactly 3.11)
PyTorch 2.2.1 with CUDA 12.1 support
transformers, tokenizers, safetensors, accelerate (all pinned versions)
System libraries: libcudnn, libcublas, libnccl
Model weights: 4-14GB binary files

Without Docker, every deployment is a dice roll. With Docker, you ship the exact working environment.

code

┌─────────────────────────────────────────┐
│  WITHOUT DOCKER                         │
│  Dev laptop → "works" ✓                 │
│  Server → wrong Python → ✗             │
│  Client machine → missing CUDA → ✗     │
│  Teammate → different PyTorch → ✗      │
├─────────────────────────────────────────┤
│  WITH DOCKER                            │
│  Dev laptop → container → ✓            │
│  Server → same container → ✓           │
│  Client machine → same container → ✓   │
│  Teammate → same container → ✓         │
└─────────────────────────────────────────┘

Docker Fundamentals for AI Engineers

Key Concepts

Concept	What It Is	AI Example
Image	Blueprint/template	"Python 3.11 + PyTorch + my model"
Container	Running instance of an image	Your API processing requests right now
Dockerfile	Build instructions	Recipe to create the image
Volume	Persistent storage	Model weights (don't bake 14GB into the image)
Port mapping	Network access	Expose port 8000 for your FastAPI
Registry	Image storage	Docker Hub, GitHub Container Registry

Installing Docker

Windows (with WSL2):

bash

# Install Docker Desktop from docker.com
# Enable WSL2 backend in settings
# For GPU support: install NVIDIA Container Toolkit

Ubuntu/Debian VPS:

bash

curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
# For GPU: install nvidia-container-toolkit
sudo apt install nvidia-container-toolkit
sudo systemctl restart docker

Your First AI Dockerfile

Example: Containerizing a Llama 3 API

dockerfile

# Base image with CUDA support
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

# Install Python
RUN apt-get update && apt-get install -y \
    python3.11 python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Install Python dependencies first (cached layer)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Expose the API port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s \
    CMD curl -f http://localhost:8000/health || exit 1

# Start the API server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

The requirements.txt

code

fastapi==0.109.0
uvicorn[standard]==0.27.0
transformers==4.37.0
torch==2.2.1
accelerate==0.27.0
safetensors==0.4.2

Building and Running

bash

# Build the image
docker build -t my-llm-api:v1 .

# Run WITHOUT GPU
docker run -p 8000:8000 my-llm-api:v1

# Run WITH GPU (NVIDIA)
docker run --gpus all -p 8000:8000 my-llm-api:v1

# Run with model weights mounted from host
docker run --gpus all \
    -p 8000:8000 \
    -v /home/models/llama3:/app/models \
    my-llm-api:v1

Dockerfile Best Practices for AI

Multi-Stage Builds (Smaller Images)

AI images can be 10-20GB. Multi-stage builds cut that down:

dockerfile

# Stage 1: Build dependencies
FROM python:3.11-slim AS builder
COPY requirements.txt .
RUN pip install --prefix=/install --no-cache-dir -r requirements.txt

# Stage 2: Runtime (smaller base)
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
COPY --from=builder /install /usr/local
COPY . /app
WORKDIR /app
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Layer Caching Strategy

Docker caches each layer. Put things that change rarely at the top:

dockerfile

# SLOW TO CHANGE (cached)
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y python3.11 python3-pip

# MEDIUM (cached unless deps change)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# FAST TO CHANGE (rebuilt often)
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Don't Bake Model Weights into Images

Model files are 4-14GB. Baking them into the image means:

14GB download every time you update code
14GB stored in Docker registry
Slow builds, slow deploys

Instead, use volumes:

bash

# Mount model directory at runtime
docker run --gpus all \
    -v /data/models:/app/models \
    -p 8000:8000 \
    my-llm-api:v1

.dockerignore

code

__pycache__/
*.pyc
.git/
.env
models/
*.bin
*.safetensors
node_modules/

Docker Compose for AI Stacks

Real AI deployments have multiple services. Docker Compose orchestrates them:

yaml

# docker-compose.yml
version: '3.8'

services:
  llm-api:
    build: ./api
    ports:
      - "8000:8000"
    volumes:
      - ./models:/app/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - MODEL_PATH=/app/models/llama3-8b-q4
      - MAX_TOKENS=2048

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - llm-api

Run the entire stack:

bash

docker compose up -d        # Start all services
docker compose logs -f      # Watch logs
docker compose down         # Stop everything

GPU Access in Docker

NVIDIA Container Toolkit

bash

# Install on Ubuntu
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify GPU access
docker run --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

GPU Flags

bash

--gpus all          # All GPUs
--gpus '"device=0"' # Specific GPU by index
--gpus '"device=GPU-abc123"'  # Specific GPU by UUID

Practice Lab

Task 1: Containerize a Simple AI API Create a Dockerfile for a FastAPI app that loads a small model (e.g., distilbert-base-uncased for sentiment analysis). Build and run it locally. Test the /predict endpoint with curl.

Task 2: Docker Compose Stack Create a docker-compose.yml with your AI API + Redis for caching + Nginx as reverse proxy. Run all three services together.

Task 3: Optimization Challenge Take your Dockerfile from Task 1 and optimize it: use multi-stage build, proper layer ordering, and .dockerignore. Compare the image size before and after.

Pakistan Case Study

Meet Fahad — an ML engineer at a Lahore startup building an Urdu NLP API.

His problem: The API worked perfectly on his RTX 3080 laptop. Deploying to their Hetzner VPS was a 2-day nightmare every time — CUDA version mismatches, Python conflicts, missing system libraries.

His Docker solution:

Created a single Dockerfile with pinned CUDA + Python + PyTorch versions
Model weights stored on a persistent volume (not in the image)
docker-compose.yml with API + Redis cache + Nginx SSL

Deployment now:

bash

git pull && docker compose up -d --build
# Done. 3 minutes.

Results:

Deployment time: 2 days → 3 minutes
"Broken on server" incidents: 2-3/month → zero
New team member onboarding: 1 day of setup → docker compose up and they're running
His boss gave him a PKR 20,000 raise for "making deploys boring" (the best kind of boring)

Key Takeaways

Docker eliminates "works on my machine" — your AI runs identically everywhere
Use NVIDIA base images + Container Toolkit for GPU access in containers
Never bake model weights into images — mount them as volumes at runtime
Multi-stage builds cut image sizes by 50-70%
Docker Compose orchestrates multi-service AI stacks (API + cache + proxy)
Layer ordering matters: put rarely-changing layers first for faster builds

Next lesson: Kubernetes basics for scaling AI deployments across multiple machines.

6.1 — Docker for AI — Containerizing Models & APIs

Docker for AI — Containerizing Models & APIs

Why Docker for AI?

The Dependency Nightmare

Docker Fundamentals for AI Engineers

Key Concepts

Installing Docker

Your First AI Dockerfile

Example: Containerizing a Llama 3 API

The requirements.txt

Building and Running

Dockerfile Best Practices for AI

Multi-Stage Builds (Smaller Images)

Layer Caching Strategy

Don't Bake Model Weights into Images

.dockerignore

Docker Compose for AI Stacks

GPU Access in Docker

NVIDIA Container Toolkit

GPU Flags

Practice Lab

Pakistan Case Study

Key Takeaways

Lesson Summary

Quiz: Docker for AI — Containerizing Models & APIs