6.2 — Kubernetes Basics for AI Deployment
Kubernetes Basics for AI Deployment
Docker runs one container on one machine. But what happens when your AI API gets 10,000 requests per minute? One machine isn't enough. Kubernetes (K8s) orchestrates containers across multiple machines — automatically scaling up when demand spikes and scaling down when it's quiet. This lesson teaches you enough Kubernetes to deploy and scale AI workloads in production.
Why Kubernetes for AI?
The Scaling Problem
Single Docker container:
├── Handles ~50-200 requests/sec (depending on model)
├── If it crashes → API is down
├── If traffic spikes → requests queue and timeout
└── If you need to update → downtime during restart
Kubernetes:
├── Runs 5 copies (replicas) of your container
├── If one crashes → 4 others keep serving
├── Traffic spike → auto-scale to 20 replicas in seconds
└── Update → rolling deployment (zero downtime)
When to Use Kubernetes vs. Plain Docker
| Scenario | Just Docker | Use Kubernetes |
|---|---|---|
| Personal project / dev | Yes | Overkill |
| Small API (<100 users) | Yes | Not needed |
| Production API (100+ users) | Risky | Yes |
| Multi-model serving | Complicated | Built for it |
| Auto-scaling needs | Manual | Automatic |
| High availability required | Custom scripts | Built-in |
Kubernetes Core Concepts
The Architecture
┌─────────────────────────────────────────────────────┐
│ KUBERNETES CLUSTER │
│ │
│ Control Plane (brain) │
│ ├── API Server — receives your commands │
│ ├── Scheduler — decides which node runs what │
│ ├── Controller Manager — maintains desired state │
│ └── etcd — cluster state database │
│ │
│ Worker Nodes (muscle) │
│ ├── Node 1: [Pod A] [Pod B] [Pod C] │
│ ├── Node 2: [Pod D] [Pod E] [Pod F] ← GPUs here │
│ └── Node 3: [Pod G] [Pod H] │
└─────────────────────────────────────────────────────┘
Key Resources
| Resource | What It Does | AI Example |
|---|---|---|
| Pod | Smallest unit — runs your container(s) | One instance of your LLM API |
| Deployment | Manages replica pods | "Keep 3 copies of my API running" |
| Service | Stable network endpoint | Load balancer across all pods |
| Ingress | External access + SSL | HTTPS endpoint for clients |
| ConfigMap | Configuration data | Model name, max tokens, temperature |
| Secret | Sensitive data | API keys, database passwords |
| PersistentVolume | Storage that survives pod restarts | Model weight files |
| HPA | Horizontal Pod Autoscaler | Scale from 2→10 pods when CPU > 70% |
Setting Up a Local Kubernetes Cluster
Option 1: Minikube (Learning)
# Install minikube
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube
# Start cluster with GPU support
minikube start --driver=docker --gpus all
# Verify
kubectl get nodes
Option 2: K3s (Lightweight Production)
K3s is Kubernetes stripped down to essentials — perfect for VPS deployments:
# Install K3s (single command)
curl -sfL https://get.k3s.io | sh -
# Verify
sudo k3s kubectl get nodes
Option 3: Managed Kubernetes (Production)
| Provider | Service | GPU Nodes | Starting Cost |
|---|---|---|---|
| GKE (Google) | Google Kubernetes Engine | T4, A100 | ~$150/month |
| EKS (AWS) | Elastic Kubernetes Service | T4, A10G | ~$170/month |
| AKS (Azure) | Azure Kubernetes Service | T4, A100 | ~$160/month |
| Vultr | Vultr Kubernetes | A100 | ~$140/month |
Deploying Your AI API to Kubernetes
Step 1: Create a Deployment
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-api
labels:
app: llm-api
spec:
replicas: 3
selector:
matchLabels:
app: llm-api
template:
metadata:
labels:
app: llm-api
spec:
containers:
- name: llm-api
image: your-registry/llm-api:v1
ports:
- containerPort: 8000
resources:
requests:
memory: "4Gi"
cpu: "2"
nvidia.com/gpu: "1"
limits:
memory: "8Gi"
cpu: "4"
nvidia.com/gpu: "1"
env:
- name: MODEL_PATH
value: "/models/llama3-8b-q4"
- name: MAX_TOKENS
value: "2048"
volumeMounts:
- name: model-storage
mountPath: /models
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvc
Step 2: Create a Service
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: llm-api-service
spec:
selector:
app: llm-api
ports:
- port: 80
targetPort: 8000
type: ClusterIP
Step 3: Create an Ingress
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: llm-api-ingress
annotations:
cert-manager.io/cluster-issuer: letsencrypt
spec:
tls:
- hosts:
- api.yoursite.com
secretName: api-tls
rules:
- host: api.yoursite.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: llm-api-service
port:
number: 80
Step 4: Apply Everything
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f ingress.yaml
# Watch pods come up
kubectl get pods -w
# Check logs
kubectl logs -f deployment/llm-api
Auto-Scaling AI Workloads
Horizontal Pod Autoscaler (HPA)
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
This says: "Keep at least 2 pods. If average CPU goes above 70%, add pods. Maximum 10 pods."
Essential kubectl Commands
# Cluster info
kubectl get nodes # List worker nodes
kubectl get pods # List running pods
kubectl get services # List services
# Deployment management
kubectl apply -f deployment.yaml # Create/update resources
kubectl rollout status deployment/llm-api # Watch deployment
kubectl rollout undo deployment/llm-api # Rollback
# Debugging
kubectl describe pod <pod-name> # Detailed pod info
kubectl logs <pod-name> # Container logs
kubectl exec -it <pod-name> -- bash # Shell into container
# Scaling
kubectl scale deployment llm-api --replicas=5 # Manual scale
Practice Lab
Task 1: Local Cluster Install minikube or K3s. Deploy a simple web server (nginx) with 3 replicas. Verify all pods are running and the service is accessible.
Task 2: AI Deployment Create Kubernetes YAML files for your containerized AI API from Lesson 6.1. Deploy it with 2 replicas, a Service, and health checks.
Task 3: Auto-Scaling Test Set up an HPA on your deployment. Use a load testing tool (hey, wrk, or k6) to generate traffic and watch pods auto-scale.
Pakistan Case Study
Meet Zain — DevOps engineer at an Islamabad AI startup serving Urdu OCR as an API.
His problem: Single Docker container on a Hetzner VPS handled 50 req/sec. A corporate client needed 500 req/sec with 99.9% uptime SLA.
His K3s solution:
- 3 Hetzner servers (each PKR 8,000/month = PKR 24,000 total)
- K3s cluster with GPU node for inference
- HPA: 2-8 replicas based on request queue depth
- Rolling deployments for zero-downtime updates
Results:
- Capacity: 50 → 600 req/sec
- Uptime: 99.95% (exceeded SLA)
- Deployment time: 20 min manual → 2 min
kubectl apply - Won the enterprise contract: PKR 300,000/month recurring
Key Takeaways
- Kubernetes orchestrates multiple containers across machines for scaling and reliability
- Key resources: Pod (runs container), Deployment (manages replicas), Service (load balancer), Ingress (external access)
- K3s is the lightweight choice for VPS — full K8s is for managed cloud
- HPA auto-scales pods based on CPU/memory/custom metrics
- Always set resource requests/limits — prevents one pod from starving others
- Health checks (readinessProbe) prevent traffic routing to unhealthy pods
Next lesson: GPU scheduling and resource management in Kubernetes clusters.
Lesson Summary
Quiz: Kubernetes Basics for AI Deployment
4 questions to test your understanding. Score 60% or higher to pass.