AI Infrastructure & Local LLMsModule 6

6.1Docker for AI — Containerizing Models & APIs

30 min 15 code blocks Practice Lab Quiz (4Q)

Docker for AI — Containerizing Models & APIs

You built a working AI model on your laptop. You send it to your teammate. It doesn't work. "Works on my machine" is the oldest problem in software — and Docker is the solution. Docker packages your AI model, its dependencies, and the entire runtime environment into a single portable container that runs identically everywhere. This lesson teaches you to containerize AI models and APIs for production deployment.

Why Docker for AI?

The Dependency Nightmare

A typical AI project requires:

  • Python 3.11 (not 3.12, not 3.10 — exactly 3.11)
  • PyTorch 2.2.1 with CUDA 12.1 support
  • transformers, tokenizers, safetensors, accelerate (all pinned versions)
  • System libraries: libcudnn, libcublas, libnccl
  • Model weights: 4-14GB binary files

Without Docker, every deployment is a dice roll. With Docker, you ship the exact working environment.

code
┌─────────────────────────────────────────┐
│  WITHOUT DOCKER                         │
│  Dev laptop → "works" ✓                 │
│  Server → wrong Python → ✗             │
│  Client machine → missing CUDA → ✗     │
│  Teammate → different PyTorch → ✗      │
├─────────────────────────────────────────┤
│  WITH DOCKER                            │
│  Dev laptop → container → ✓            │
│  Server → same container → ✓           │
│  Client machine → same container → ✓   │
│  Teammate → same container → ✓         │
└─────────────────────────────────────────┘

Docker Fundamentals for AI Engineers

Key Concepts

ConceptWhat It IsAI Example
ImageBlueprint/template"Python 3.11 + PyTorch + my model"
ContainerRunning instance of an imageYour API processing requests right now
DockerfileBuild instructionsRecipe to create the image
VolumePersistent storageModel weights (don't bake 14GB into the image)
Port mappingNetwork accessExpose port 8000 for your FastAPI
RegistryImage storageDocker Hub, GitHub Container Registry

Installing Docker

Windows (with WSL2):

bash
# Install Docker Desktop from docker.com
# Enable WSL2 backend in settings
# For GPU support: install NVIDIA Container Toolkit

Ubuntu/Debian VPS:

bash
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
# For GPU: install nvidia-container-toolkit
sudo apt install nvidia-container-toolkit
sudo systemctl restart docker

Your First AI Dockerfile

Example: Containerizing a Llama 3 API

dockerfile
# Base image with CUDA support
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

# Install Python
RUN apt-get update && apt-get install -y \
    python3.11 python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Install Python dependencies first (cached layer)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Expose the API port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s \
    CMD curl -f http://localhost:8000/health || exit 1

# Start the API server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

The requirements.txt

code
fastapi==0.109.0
uvicorn[standard]==0.27.0
transformers==4.37.0
torch==2.2.1
accelerate==0.27.0
safetensors==0.4.2

Building and Running

bash
# Build the image
docker build -t my-llm-api:v1 .

# Run WITHOUT GPU
docker run -p 8000:8000 my-llm-api:v1

# Run WITH GPU (NVIDIA)
docker run --gpus all -p 8000:8000 my-llm-api:v1

# Run with model weights mounted from host
docker run --gpus all \
    -p 8000:8000 \
    -v /home/models/llama3:/app/models \
    my-llm-api:v1

Dockerfile Best Practices for AI

1

Multi-Stage Builds (Smaller Images)

AI images can be 10-20GB. Multi-stage builds cut that down:

dockerfile
# Stage 1: Build dependencies
FROM python:3.11-slim AS builder
COPY requirements.txt .
RUN pip install --prefix=/install --no-cache-dir -r requirements.txt

# Stage 2: Runtime (smaller base)
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
COPY --from=builder /install /usr/local
COPY . /app
WORKDIR /app
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
2

Layer Caching Strategy

Docker caches each layer. Put things that change rarely at the top:

dockerfile
# SLOW TO CHANGE (cached)
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y python3.11 python3-pip

# MEDIUM (cached unless deps change)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# FAST TO CHANGE (rebuilt often)
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
3

Don't Bake Model Weights into Images

Model files are 4-14GB. Baking them into the image means:

  • 14GB download every time you update code
  • 14GB stored in Docker registry
  • Slow builds, slow deploys

Instead, use volumes:

bash
# Mount model directory at runtime
docker run --gpus all \
    -v /data/models:/app/models \
    -p 8000:8000 \
    my-llm-api:v1
4

.dockerignore

code
__pycache__/
*.pyc
.git/
.env
models/
*.bin
*.safetensors
node_modules/

Docker Compose for AI Stacks

Real AI deployments have multiple services. Docker Compose orchestrates them:

yaml
# docker-compose.yml
version: '3.8'

services:
  llm-api:
    build: ./api
    ports:
      - "8000:8000"
    volumes:
      - ./models:/app/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - MODEL_PATH=/app/models/llama3-8b-q4
      - MAX_TOKENS=2048

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - llm-api

Run the entire stack:

bash
docker compose up -d        # Start all services
docker compose logs -f      # Watch logs
docker compose down         # Stop everything

GPU Access in Docker

NVIDIA Container Toolkit

bash
# Install on Ubuntu
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify GPU access
docker run --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

GPU Flags

bash
--gpus all          # All GPUs
--gpus '"device=0"' # Specific GPU by index
--gpus '"device=GPU-abc123"'  # Specific GPU by UUID
Practice Lab

Practice Lab

Task 1: Containerize a Simple AI API Create a Dockerfile for a FastAPI app that loads a small model (e.g., distilbert-base-uncased for sentiment analysis). Build and run it locally. Test the /predict endpoint with curl.

Task 2: Docker Compose Stack Create a docker-compose.yml with your AI API + Redis for caching + Nginx as reverse proxy. Run all three services together.

Task 3: Optimization Challenge Take your Dockerfile from Task 1 and optimize it: use multi-stage build, proper layer ordering, and .dockerignore. Compare the image size before and after.

Pakistan Case Study

Meet Fahad — an ML engineer at a Lahore startup building an Urdu NLP API.

His problem: The API worked perfectly on his RTX 3080 laptop. Deploying to their Hetzner VPS was a 2-day nightmare every time — CUDA version mismatches, Python conflicts, missing system libraries.

His Docker solution:

  • Created a single Dockerfile with pinned CUDA + Python + PyTorch versions
  • Model weights stored on a persistent volume (not in the image)
  • docker-compose.yml with API + Redis cache + Nginx SSL

Deployment now:

bash
git pull && docker compose up -d --build
# Done. 3 minutes.

Results:

  • Deployment time: 2 days → 3 minutes
  • "Broken on server" incidents: 2-3/month → zero
  • New team member onboarding: 1 day of setup → docker compose up and they're running
  • His boss gave him a PKR 20,000 raise for "making deploys boring" (the best kind of boring)

Key Takeaways

  • Docker eliminates "works on my machine" — your AI runs identically everywhere
  • Use NVIDIA base images + Container Toolkit for GPU access in containers
  • Never bake model weights into images — mount them as volumes at runtime
  • Multi-stage builds cut image sizes by 50-70%
  • Docker Compose orchestrates multi-service AI stacks (API + cache + proxy)
  • Layer ordering matters: put rarely-changing layers first for faster builds

Next lesson: Kubernetes basics for scaling AI deployments across multiple machines.

Lesson Summary

Includes hands-on practice lab15 runnable code examples4-question knowledge check below

Quiz: Docker for AI — Containerizing Models & APIs

4 questions to test your understanding. Score 60% or higher to pass.