AI Infrastructure & Local LLMsModule 6

6.2Kubernetes Basics for AI Deployment

35 min 10 code blocks Practice Lab Quiz (4Q)

Kubernetes Basics for AI Deployment

Docker runs one container on one machine. But what happens when your AI API gets 10,000 requests per minute? One machine isn't enough. Kubernetes (K8s) orchestrates containers across multiple machines — automatically scaling up when demand spikes and scaling down when it's quiet. This lesson teaches you enough Kubernetes to deploy and scale AI workloads in production.

Why Kubernetes for AI?

The Scaling Problem

code
Single Docker container:
├── Handles ~50-200 requests/sec (depending on model)
├── If it crashes → API is down
├── If traffic spikes → requests queue and timeout
└── If you need to update → downtime during restart

Kubernetes:
├── Runs 5 copies (replicas) of your container
├── If one crashes → 4 others keep serving
├── Traffic spike → auto-scale to 20 replicas in seconds
└── Update → rolling deployment (zero downtime)

When to Use Kubernetes vs. Plain Docker

ScenarioJust DockerUse Kubernetes
Personal project / devYesOverkill
Small API (<100 users)YesNot needed
Production API (100+ users)RiskyYes
Multi-model servingComplicatedBuilt for it
Auto-scaling needsManualAutomatic
High availability requiredCustom scriptsBuilt-in

Kubernetes Core Concepts

The Architecture

code
┌─────────────────────────────────────────────────────┐
│  KUBERNETES CLUSTER                                  │
│                                                      │
│  Control Plane (brain)                               │
│  ├── API Server — receives your commands             │
│  ├── Scheduler — decides which node runs what        │
│  ├── Controller Manager — maintains desired state    │
│  └── etcd — cluster state database                   │
│                                                      │
│  Worker Nodes (muscle)                               │
│  ├── Node 1: [Pod A] [Pod B] [Pod C]               │
│  ├── Node 2: [Pod D] [Pod E] [Pod F]  ← GPUs here  │
│  └── Node 3: [Pod G] [Pod H]                        │
└─────────────────────────────────────────────────────┘

Key Resources

ResourceWhat It DoesAI Example
PodSmallest unit — runs your container(s)One instance of your LLM API
DeploymentManages replica pods"Keep 3 copies of my API running"
ServiceStable network endpointLoad balancer across all pods
IngressExternal access + SSLHTTPS endpoint for clients
ConfigMapConfiguration dataModel name, max tokens, temperature
SecretSensitive dataAPI keys, database passwords
PersistentVolumeStorage that survives pod restartsModel weight files
HPAHorizontal Pod AutoscalerScale from 2→10 pods when CPU > 70%

Setting Up a Local Kubernetes Cluster

Option 1: Minikube (Learning)

bash
# Install minikube
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube

# Start cluster with GPU support
minikube start --driver=docker --gpus all

# Verify
kubectl get nodes

Option 2: K3s (Lightweight Production)

K3s is Kubernetes stripped down to essentials — perfect for VPS deployments:

bash
# Install K3s (single command)
curl -sfL https://get.k3s.io | sh -

# Verify
sudo k3s kubectl get nodes

Option 3: Managed Kubernetes (Production)

ProviderServiceGPU NodesStarting Cost
GKE (Google)Google Kubernetes EngineT4, A100~$150/month
EKS (AWS)Elastic Kubernetes ServiceT4, A10G~$170/month
AKS (Azure)Azure Kubernetes ServiceT4, A100~$160/month
VultrVultr KubernetesA100~$140/month

Deploying Your AI API to Kubernetes

Step 1: Create a Deployment

yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-api
  labels:
    app: llm-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-api
  template:
    metadata:
      labels:
        app: llm-api
    spec:
      containers:
        - name: llm-api
          image: your-registry/llm-api:v1
          ports:
            - containerPort: 8000
          resources:
            requests:
              memory: "4Gi"
              cpu: "2"
              nvidia.com/gpu: "1"
            limits:
              memory: "8Gi"
              cpu: "4"
              nvidia.com/gpu: "1"
          env:
            - name: MODEL_PATH
              value: "/models/llama3-8b-q4"
            - name: MAX_TOKENS
              value: "2048"
          volumeMounts:
            - name: model-storage
              mountPath: /models
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: model-pvc

Step 2: Create a Service

yaml
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: llm-api-service
spec:
  selector:
    app: llm-api
  ports:
    - port: 80
      targetPort: 8000
  type: ClusterIP

Step 3: Create an Ingress

yaml
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llm-api-ingress
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt
spec:
  tls:
    - hosts:
        - api.yoursite.com
      secretName: api-tls
  rules:
    - host: api.yoursite.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: llm-api-service
                port:
                  number: 80

Step 4: Apply Everything

bash
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f ingress.yaml

# Watch pods come up
kubectl get pods -w

# Check logs
kubectl logs -f deployment/llm-api

Auto-Scaling AI Workloads

Horizontal Pod Autoscaler (HPA)

yaml
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

This says: "Keep at least 2 pods. If average CPU goes above 70%, add pods. Maximum 10 pods."

Essential kubectl Commands

bash
# Cluster info
kubectl get nodes                    # List worker nodes
kubectl get pods                     # List running pods
kubectl get services                 # List services

# Deployment management
kubectl apply -f deployment.yaml     # Create/update resources
kubectl rollout status deployment/llm-api  # Watch deployment
kubectl rollout undo deployment/llm-api    # Rollback

# Debugging
kubectl describe pod <pod-name>      # Detailed pod info
kubectl logs <pod-name>              # Container logs
kubectl exec -it <pod-name> -- bash  # Shell into container

# Scaling
kubectl scale deployment llm-api --replicas=5  # Manual scale
Practice Lab

Practice Lab

Task 1: Local Cluster Install minikube or K3s. Deploy a simple web server (nginx) with 3 replicas. Verify all pods are running and the service is accessible.

Task 2: AI Deployment Create Kubernetes YAML files for your containerized AI API from Lesson 6.1. Deploy it with 2 replicas, a Service, and health checks.

Task 3: Auto-Scaling Test Set up an HPA on your deployment. Use a load testing tool (hey, wrk, or k6) to generate traffic and watch pods auto-scale.

Pakistan Case Study

Meet Zain — DevOps engineer at an Islamabad AI startup serving Urdu OCR as an API.

His problem: Single Docker container on a Hetzner VPS handled 50 req/sec. A corporate client needed 500 req/sec with 99.9% uptime SLA.

His K3s solution:

  • 3 Hetzner servers (each PKR 8,000/month = PKR 24,000 total)
  • K3s cluster with GPU node for inference
  • HPA: 2-8 replicas based on request queue depth
  • Rolling deployments for zero-downtime updates

Results:

  • Capacity: 50 → 600 req/sec
  • Uptime: 99.95% (exceeded SLA)
  • Deployment time: 20 min manual → 2 min kubectl apply
  • Won the enterprise contract: PKR 300,000/month recurring

Key Takeaways

  • Kubernetes orchestrates multiple containers across machines for scaling and reliability
  • Key resources: Pod (runs container), Deployment (manages replicas), Service (load balancer), Ingress (external access)
  • K3s is the lightweight choice for VPS — full K8s is for managed cloud
  • HPA auto-scales pods based on CPU/memory/custom metrics
  • Always set resource requests/limits — prevents one pod from starving others
  • Health checks (readinessProbe) prevent traffic routing to unhealthy pods

Next lesson: GPU scheduling and resource management in Kubernetes clusters.

Lesson Summary

Includes hands-on practice lab10 runnable code examples4-question knowledge check below

Quiz: Kubernetes Basics for AI Deployment

4 questions to test your understanding. Score 60% or higher to pass.