7.2 — Load Balancing & Auto-Scaling AI Services
Load Balancing & Auto-Scaling AI Services
Your FastAPI serves one model on one GPU. But real production means handling traffic spikes, surviving server crashes, and distributing load across multiple instances. This lesson teaches you to put a load balancer in front of your AI APIs and configure auto-scaling that reacts to demand — so your infrastructure grows and shrinks with your traffic.
Why Load Balancing Matters for AI
AI inference is computationally expensive. A single GPU can handle maybe 20-50 concurrent requests for a 7B model. Without load balancing:
100 concurrent users → 1 server → 50 timeout, 50 served (50% error rate)
100 concurrent users → load balancer → 3 servers → all served (0% error rate)
Load Balancing Strategies
Strategy Comparison for AI
| Strategy | How It Works | Best For AI? |
|---|---|---|
| Round Robin | Rotate requests across servers | Bad — ignores GPU load |
| Least Connections | Send to server with fewest active requests | Good — respects capacity |
| Weighted | Send more to powerful servers | Great — A100 gets 3x traffic vs T4 |
| IP Hash | Same client → same server | Good for session/context caching |
| Custom (GPU-aware) | Route based on GPU utilization | Best — purpose-built for AI |
Nginx Load Balancer Configuration
# /etc/nginx/nginx.conf
upstream llm_backends {
least_conn; # Route to server with fewest connections
server 10.0.1.10:8000 weight=3; # A100 server (3x capacity)
server 10.0.1.11:8000 weight=1; # T4 server (baseline)
server 10.0.1.12:8000 weight=1; # T4 server (baseline)
}
server {
listen 80;
server_name api.yoursite.com;
# Longer timeouts for AI inference
proxy_read_timeout 120s;
proxy_connect_timeout 10s;
proxy_send_timeout 30s;
location /v1/ {
proxy_pass http://llm_backends;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# SSE streaming support
proxy_buffering off;
proxy_cache off;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
location /health {
proxy_pass http://llm_backends;
}
}
Health Checks
Nginx checks if backends are alive:
upstream llm_backends {
least_conn;
server 10.0.1.10:8000 max_fails=3 fail_timeout=30s;
server 10.0.1.11:8000 max_fails=3 fail_timeout=30s;
# If a server fails 3 health checks in 30s, remove it from rotation
}
Auto-Scaling Patterns
Pattern 1: Kubernetes HPA (Horizontal Pod Autoscaler)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-api
minReplicas: 2
maxReplicas: 10
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Wait 60s before scaling up
policies:
- type: Pods
value: 2 # Add max 2 pods at a time
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Pods
value: 1 # Remove max 1 pod at a time
periodSeconds: 120
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: request_queue_depth
target:
type: AverageValue
averageValue: "5"
Why asymmetric scaling?
- Scale UP fast (60s) — don't lose requests
- Scale DOWN slow (5min) — avoid thrashing (spin up → spin down → spin up)
- AI model loading takes 30-120s — scaling too fast creates pods that aren't ready
Pattern 2: Queue-Based Scaling
Instead of CPU-based scaling, use request queue depth:
Queue depth < 5 → 2 replicas (baseline)
Queue depth 5-20 → 4 replicas
Queue depth > 20 → 8 replicas
Queue depth > 50 → 10 replicas (max)
This is better than CPU-based for AI because:
- GPU utilization is what matters, not CPU
- Queue depth directly measures user-facing latency pressure
Pattern 3: Scheduled Scaling
If your traffic is predictable:
# Scale up for business hours (Pakistan: 9 AM - 6 PM PKT)
apiVersion: autoscaling/v2
kind: CronHPA
metadata:
name: business-hours
spec:
schedules:
- schedule: "0 9 * * 1-5" # 9 AM Mon-Fri
minReplicas: 4
maxReplicas: 10
- schedule: "0 18 * * 1-5" # 6 PM Mon-Fri
minReplicas: 2
maxReplicas: 4
- schedule: "0 0 * * 6-7" # Weekends
minReplicas: 1
maxReplicas: 2
Handling Cold Starts
AI models take 30-120 seconds to load. During scale-up, new pods aren't ready immediately.
Solution: Warm Pool
Keep 1-2 "warm" pods with models pre-loaded but not receiving traffic:
spec:
replicas: 3 # 2 active + 1 warm
# Readiness probe: only route traffic when model is loaded
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 90 # Model loading time
periodSeconds: 10
Solution: Model Caching
Pre-download model weights to a shared volume:
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: shared-models
# All pods mount the same volume — no download needed per pod
# Model loading: 120s (download + load) → 30s (load from cache)
Monitoring & Alerting
Key Metrics to Track
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| Request latency (p95) | User experience | > 5s for chat, > 30s for generation |
| Error rate | Service health | > 1% of requests |
| Queue depth | Capacity pressure | > 20 requests waiting |
| GPU utilization | Resource efficiency | < 30% (wasting money) or > 90% (overloaded) |
| Active replicas | Scaling behavior | Unexpected changes |
| VRAM usage | Memory pressure | > 90% of available |
Alerting with Prometheus + AlertManager
# alert-rules.yaml
groups:
- name: llm-api-alerts
rules:
- alert: HighLatency
expr: histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m])) > 5
for: 2m
labels:
severity: warning
annotations:
summary: "LLM API p95 latency above 5 seconds"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for: 1m
labels:
severity: critical
Practice Lab
Task 1: Nginx Load Balancer
Set up Nginx as a reverse proxy in front of 2 FastAPI instances (can run on different ports on the same machine). Configure least_conn strategy and test with concurrent requests.
Task 2: Auto-Scaling Simulation Deploy your AI API on Kubernetes with HPA. Use a load testing tool to generate traffic ramps (10 → 50 → 100 → 10 req/sec) and observe pods scaling up and down.
Task 3: Cold Start Optimization Measure your model's cold start time. Implement the shared volume model cache approach and measure the improvement.
Pakistan Case Study
Meet Nadia — CTO of a Karachi AI startup offering Urdu text summarization as a service.
Her problem: A news aggregator client signed up. Traffic pattern: 500 req/min during morning news hours (7-9 AM), 50 req/min rest of the day. Fixed 4-server setup meant paying for peak capacity 24/7.
Her auto-scaling solution:
- K3s cluster on 5 Hetzner VPS nodes (GPU-equipped)
- HPA: min 1 replica, max 6, scale on queue depth
- Scheduled scaling: pre-warm 3 replicas at 6:45 AM PKT
- Shared model volume: cold start 90s → 25s
Results:
- Morning spike handled without errors (auto-scaled to 5 replicas)
- Off-peak: scaled down to 1 replica
- Monthly infrastructure cost: PKR 80,000 → PKR 45,000 (44% savings)
- Client SLA: 99.9% uptime achieved (vs. 98.5% before auto-scaling)
Key Takeaways
- Load balancing distributes AI requests across multiple GPU servers
- Use "least connections" strategy — round robin ignores GPU load differences
- Auto-scale based on queue depth, not CPU — GPU utilization is what matters
- Scale up fast (60s), scale down slow (5min) to avoid thrashing
- Cold starts are the main challenge — use shared model volumes and warm pools
- Scheduled scaling for predictable traffic saves money (pre-warm before peaks)
- Monitor p95 latency, error rate, and GPU utilization — alert on thresholds
Next lesson: API gateway and authentication — protecting and monetizing your AI APIs.
Lesson Summary
Quiz: Load Balancing & Auto-Scaling AI Services
4 questions to test your understanding. Score 60% or higher to pass.