Load Balancing & Auto-Scaling AI Services

Your FastAPI serves one model on one GPU. But real production means handling traffic spikes, surviving server crashes, and distributing load across multiple instances. This lesson teaches you to put a load balancer in front of your AI APIs and configure auto-scaling that reacts to demand — so your infrastructure grows and shrinks with your traffic.

Why Load Balancing Matters for AI

AI inference is computationally expensive. A single GPU can handle maybe 20-50 concurrent requests for a 7B model. Without load balancing:

code

100 concurrent users → 1 server → 50 timeout, 50 served (50% error rate)

100 concurrent users → load balancer → 3 servers → all served (0% error rate)

Load Balancing Strategies

Strategy Comparison for AI

Strategy	How It Works	Best For AI?
Round Robin	Rotate requests across servers	Bad — ignores GPU load
Least Connections	Send to server with fewest active requests	Good — respects capacity
Weighted	Send more to powerful servers	Great — A100 gets 3x traffic vs T4
IP Hash	Same client → same server	Good for session/context caching
Custom (GPU-aware)	Route based on GPU utilization	Best — purpose-built for AI

Nginx Load Balancer Configuration

nginx

# /etc/nginx/nginx.conf
upstream llm_backends {
    least_conn;    # Route to server with fewest connections

    server 10.0.1.10:8000 weight=3;  # A100 server (3x capacity)
    server 10.0.1.11:8000 weight=1;  # T4 server (baseline)
    server 10.0.1.12:8000 weight=1;  # T4 server (baseline)
}

server {
    listen 80;
    server_name api.yoursite.com;

    # Longer timeouts for AI inference
    proxy_read_timeout 120s;
    proxy_connect_timeout 10s;
    proxy_send_timeout 30s;

    location /v1/ {
        proxy_pass http://llm_backends;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        # SSE streaming support
        proxy_buffering off;
        proxy_cache off;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }

    location /health {
        proxy_pass http://llm_backends;
    }
}

Health Checks

Nginx checks if backends are alive:

nginx

upstream llm_backends {
    least_conn;
    server 10.0.1.10:8000 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:8000 max_fails=3 fail_timeout=30s;
    # If a server fails 3 health checks in 30s, remove it from rotation
}

Auto-Scaling Patterns

Pattern 1: Kubernetes HPA (Horizontal Pod Autoscaler)

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-api
  minReplicas: 2
  maxReplicas: 10
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # Wait 60s before scaling up
      policies:
        - type: Pods
          value: 2                       # Add max 2 pods at a time
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 min before scaling down
      policies:
        - type: Pods
          value: 1                       # Remove max 1 pod at a time
          periodSeconds: 120
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: request_queue_depth
        target:
          type: AverageValue
          averageValue: "5"

Why asymmetric scaling?

Scale UP fast (60s) — don't lose requests
Scale DOWN slow (5min) — avoid thrashing (spin up → spin down → spin up)
AI model loading takes 30-120s — scaling too fast creates pods that aren't ready

Pattern 2: Queue-Based Scaling

Instead of CPU-based scaling, use request queue depth:

code

Queue depth < 5  → 2 replicas (baseline)
Queue depth 5-20 → 4 replicas
Queue depth > 20 → 8 replicas
Queue depth > 50 → 10 replicas (max)

This is better than CPU-based for AI because:

GPU utilization is what matters, not CPU
Queue depth directly measures user-facing latency pressure

Pattern 3: Scheduled Scaling

If your traffic is predictable:

yaml

# Scale up for business hours (Pakistan: 9 AM - 6 PM PKT)
apiVersion: autoscaling/v2
kind: CronHPA
metadata:
  name: business-hours
spec:
  schedules:
    - schedule: "0 9 * * 1-5"   # 9 AM Mon-Fri
      minReplicas: 4
      maxReplicas: 10
    - schedule: "0 18 * * 1-5"  # 6 PM Mon-Fri
      minReplicas: 2
      maxReplicas: 4
    - schedule: "0 0 * * 6-7"   # Weekends
      minReplicas: 1
      maxReplicas: 2

Handling Cold Starts

AI models take 30-120 seconds to load. During scale-up, new pods aren't ready immediately.

Solution: Warm Pool

Keep 1-2 "warm" pods with models pre-loaded but not receiving traffic:

yaml

spec:
  replicas: 3    # 2 active + 1 warm

  # Readiness probe: only route traffic when model is loaded
  readinessProbe:
    httpGet:
      path: /health
      port: 8000
    initialDelaySeconds: 90   # Model loading time
    periodSeconds: 10

Solution: Model Caching

Pre-download model weights to a shared volume:

yaml

volumes:
  - name: model-cache
    persistentVolumeClaim:
      claimName: shared-models

# All pods mount the same volume — no download needed per pod
# Model loading: 120s (download + load) → 30s (load from cache)

Monitoring & Alerting

Key Metrics to Track

Metric	What It Tells You	Alert Threshold
Request latency (p95)	User experience	> 5s for chat, > 30s for generation
Error rate	Service health	> 1% of requests
Queue depth	Capacity pressure	> 20 requests waiting
GPU utilization	Resource efficiency	< 30% (wasting money) or > 90% (overloaded)
Active replicas	Scaling behavior	Unexpected changes
VRAM usage	Memory pressure	> 90% of available

Alerting with Prometheus + AlertManager

yaml

# alert-rules.yaml
groups:
  - name: llm-api-alerts
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m])) > 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "LLM API p95 latency above 5 seconds"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
        for: 1m
        labels:
          severity: critical

Practice Lab

Task 1: Nginx Load Balancer Set up Nginx as a reverse proxy in front of 2 FastAPI instances (can run on different ports on the same machine). Configure least_conn strategy and test with concurrent requests.

Task 2: Auto-Scaling Simulation Deploy your AI API on Kubernetes with HPA. Use a load testing tool to generate traffic ramps (10 → 50 → 100 → 10 req/sec) and observe pods scaling up and down.

Task 3: Cold Start Optimization Measure your model's cold start time. Implement the shared volume model cache approach and measure the improvement.

Pakistan Case Study

Meet Nadia — CTO of a Karachi AI startup offering Urdu text summarization as a service.

Her problem: A news aggregator client signed up. Traffic pattern: 500 req/min during morning news hours (7-9 AM), 50 req/min rest of the day. Fixed 4-server setup meant paying for peak capacity 24/7.

Her auto-scaling solution:

K3s cluster on 5 Hetzner VPS nodes (GPU-equipped)
HPA: min 1 replica, max 6, scale on queue depth
Scheduled scaling: pre-warm 3 replicas at 6:45 AM PKT
Shared model volume: cold start 90s → 25s

Results:

Morning spike handled without errors (auto-scaled to 5 replicas)
Off-peak: scaled down to 1 replica
Monthly infrastructure cost: PKR 80,000 → PKR 45,000 (44% savings)
Client SLA: 99.9% uptime achieved (vs. 98.5% before auto-scaling)

Key Takeaways

Load balancing distributes AI requests across multiple GPU servers
Use "least connections" strategy — round robin ignores GPU load differences
Auto-scale based on queue depth, not CPU — GPU utilization is what matters
Scale up fast (60s), scale down slow (5min) to avoid thrashing
Cold starts are the main challenge — use shared model volumes and warm pools
Scheduled scaling for predictable traffic saves money (pre-warm before peaks)
Monitor p95 latency, error rate, and GPU utilization — alert on thresholds

Next lesson: API gateway and authentication — protecting and monetizing your AI APIs.

7.2 — Load Balancing & Auto-Scaling AI Services

Load Balancing & Auto-Scaling AI Services

Why Load Balancing Matters for AI

Load Balancing Strategies

Strategy Comparison for AI

Nginx Load Balancer Configuration

Health Checks

Auto-Scaling Patterns

Pattern 1: Kubernetes HPA (Horizontal Pod Autoscaler)

Pattern 2: Queue-Based Scaling

Pattern 3: Scheduled Scaling

Handling Cold Starts

Solution: Warm Pool

Solution: Model Caching

Monitoring & Alerting

Key Metrics to Track

Alerting with Prometheus + AlertManager

Practice Lab

Pakistan Case Study

Key Takeaways

Lesson Summary

Quiz: Load Balancing & Auto-Scaling AI Services