Kubernetes Basics for AI Deployment

Docker runs one container on one machine. But what happens when your AI API gets 10,000 requests per minute? One machine isn't enough. Kubernetes (K8s) orchestrates containers across multiple machines — automatically scaling up when demand spikes and scaling down when it's quiet. This lesson teaches you enough Kubernetes to deploy and scale AI workloads in production.

Why Kubernetes for AI?

The Scaling Problem

code

Single Docker container:
├── Handles ~50-200 requests/sec (depending on model)
├── If it crashes → API is down
├── If traffic spikes → requests queue and timeout
└── If you need to update → downtime during restart

Kubernetes:
├── Runs 5 copies (replicas) of your container
├── If one crashes → 4 others keep serving
├── Traffic spike → auto-scale to 20 replicas in seconds
└── Update → rolling deployment (zero downtime)

When to Use Kubernetes vs. Plain Docker

Scenario	Just Docker	Use Kubernetes
Personal project / dev	Yes	Overkill
Small API (<100 users)	Yes	Not needed
Production API (100+ users)	Risky	Yes
Multi-model serving	Complicated	Built for it
Auto-scaling needs	Manual	Automatic
High availability required	Custom scripts	Built-in

Kubernetes Core Concepts

The Architecture

code

┌─────────────────────────────────────────────────────┐
│  KUBERNETES CLUSTER                                  │
│                                                      │
│  Control Plane (brain)                               │
│  ├── API Server — receives your commands             │
│  ├── Scheduler — decides which node runs what        │
│  ├── Controller Manager — maintains desired state    │
│  └── etcd — cluster state database                   │
│                                                      │
│  Worker Nodes (muscle)                               │
│  ├── Node 1: [Pod A] [Pod B] [Pod C]               │
│  ├── Node 2: [Pod D] [Pod E] [Pod F]  ← GPUs here  │
│  └── Node 3: [Pod G] [Pod H]                        │
└─────────────────────────────────────────────────────┘

Key Resources

Resource	What It Does	AI Example
Pod	Smallest unit — runs your container(s)	One instance of your LLM API
Deployment	Manages replica pods	"Keep 3 copies of my API running"
Service	Stable network endpoint	Load balancer across all pods
Ingress	External access + SSL	HTTPS endpoint for clients
ConfigMap	Configuration data	Model name, max tokens, temperature
Secret	Sensitive data	API keys, database passwords
PersistentVolume	Storage that survives pod restarts	Model weight files
HPA	Horizontal Pod Autoscaler	Scale from 2→10 pods when CPU > 70%

Setting Up a Local Kubernetes Cluster

Option 1: Minikube (Learning)

bash

# Install minikube
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube

# Start cluster with GPU support
minikube start --driver=docker --gpus all

# Verify
kubectl get nodes

Option 2: K3s (Lightweight Production)

K3s is Kubernetes stripped down to essentials — perfect for VPS deployments:

bash

# Install K3s (single command)
curl -sfL https://get.k3s.io | sh -

# Verify
sudo k3s kubectl get nodes

Option 3: Managed Kubernetes (Production)

Provider	Service	GPU Nodes	Starting Cost
GKE (Google)	Google Kubernetes Engine	T4, A100	~$150/month
EKS (AWS)	Elastic Kubernetes Service	T4, A10G	~$170/month
AKS (Azure)	Azure Kubernetes Service	T4, A100	~$160/month
Vultr	Vultr Kubernetes	A100	~$140/month

Deploying Your AI API to Kubernetes

Step 1: Create a Deployment

yaml

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-api
  labels:
    app: llm-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-api
  template:
    metadata:
      labels:
        app: llm-api
    spec:
      containers:
        - name: llm-api
          image: your-registry/llm-api:v1
          ports:
            - containerPort: 8000
          resources:
            requests:
              memory: "4Gi"
              cpu: "2"
              nvidia.com/gpu: "1"
            limits:
              memory: "8Gi"
              cpu: "4"
              nvidia.com/gpu: "1"
          env:
            - name: MODEL_PATH
              value: "/models/llama3-8b-q4"
            - name: MAX_TOKENS
              value: "2048"
          volumeMounts:
            - name: model-storage
              mountPath: /models
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: model-pvc

Step 2: Create a Service

yaml

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: llm-api-service
spec:
  selector:
    app: llm-api
  ports:
    - port: 80
      targetPort: 8000
  type: ClusterIP

Step 3: Create an Ingress

yaml

# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llm-api-ingress
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt
spec:
  tls:
    - hosts:
        - api.yoursite.com
      secretName: api-tls
  rules:
    - host: api.yoursite.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: llm-api-service
                port:
                  number: 80

Step 4: Apply Everything

bash

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f ingress.yaml

# Watch pods come up
kubectl get pods -w

# Check logs
kubectl logs -f deployment/llm-api

Auto-Scaling AI Workloads

Horizontal Pod Autoscaler (HPA)

yaml

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

This says: "Keep at least 2 pods. If average CPU goes above 70%, add pods. Maximum 10 pods."

Essential kubectl Commands

bash

# Cluster info
kubectl get nodes                    # List worker nodes
kubectl get pods                     # List running pods
kubectl get services                 # List services

# Deployment management
kubectl apply -f deployment.yaml     # Create/update resources
kubectl rollout status deployment/llm-api  # Watch deployment
kubectl rollout undo deployment/llm-api    # Rollback

# Debugging
kubectl describe pod <pod-name>      # Detailed pod info
kubectl logs <pod-name>              # Container logs
kubectl exec -it <pod-name> -- bash  # Shell into container

# Scaling
kubectl scale deployment llm-api --replicas=5  # Manual scale

Practice Lab

Task 1: Local Cluster Install minikube or K3s. Deploy a simple web server (nginx) with 3 replicas. Verify all pods are running and the service is accessible.

Task 2: AI Deployment Create Kubernetes YAML files for your containerized AI API from Lesson 6.1. Deploy it with 2 replicas, a Service, and health checks.

Task 3: Auto-Scaling Test Set up an HPA on your deployment. Use a load testing tool (hey, wrk, or k6) to generate traffic and watch pods auto-scale.

Pakistan Case Study

Meet Zain — DevOps engineer at an Islamabad AI startup serving Urdu OCR as an API.

His problem: Single Docker container on a Hetzner VPS handled 50 req/sec. A corporate client needed 500 req/sec with 99.9% uptime SLA.

His K3s solution:

3 Hetzner servers (each PKR 8,000/month = PKR 24,000 total)
K3s cluster with GPU node for inference
HPA: 2-8 replicas based on request queue depth
Rolling deployments for zero-downtime updates

Results:

Capacity: 50 → 600 req/sec
Uptime: 99.95% (exceeded SLA)
Deployment time: 20 min manual → 2 min kubectl apply
Won the enterprise contract: PKR 300,000/month recurring

Key Takeaways

Kubernetes orchestrates multiple containers across machines for scaling and reliability
Key resources: Pod (runs container), Deployment (manages replicas), Service (load balancer), Ingress (external access)
K3s is the lightweight choice for VPS — full K8s is for managed cloud
HPA auto-scales pods based on CPU/memory/custom metrics
Always set resource requests/limits — prevents one pod from starving others
Health checks (readinessProbe) prevent traffic routing to unhealthy pods

Next lesson: GPU scheduling and resource management in Kubernetes clusters.

6.2 — Kubernetes Basics for AI Deployment

Kubernetes Basics for AI Deployment

Why Kubernetes for AI?

The Scaling Problem

When to Use Kubernetes vs. Plain Docker

Kubernetes Core Concepts

The Architecture

Key Resources

Setting Up a Local Kubernetes Cluster

Option 1: Minikube (Learning)

Option 2: K3s (Lightweight Production)

Option 3: Managed Kubernetes (Production)

Deploying Your AI API to Kubernetes

Step 1: Create a Deployment

Step 2: Create a Service

Step 3: Create an Ingress

Step 4: Apply Everything

Auto-Scaling AI Workloads

Horizontal Pod Autoscaler (HPA)

Essential kubectl Commands

Practice Lab

Pakistan Case Study

Key Takeaways

Lesson Summary

Quiz: Kubernetes Basics for AI Deployment