AI Infrastructure & Local LLMsModule 7

7.2Load Balancing & Auto-Scaling AI Services

25 min 9 code blocks Practice Lab Quiz (4Q)

Load Balancing & Auto-Scaling AI Services

Your FastAPI serves one model on one GPU. But real production means handling traffic spikes, surviving server crashes, and distributing load across multiple instances. This lesson teaches you to put a load balancer in front of your AI APIs and configure auto-scaling that reacts to demand — so your infrastructure grows and shrinks with your traffic.

Why Load Balancing Matters for AI

AI inference is computationally expensive. A single GPU can handle maybe 20-50 concurrent requests for a 7B model. Without load balancing:

code
100 concurrent users → 1 server → 50 timeout, 50 served (50% error rate)

100 concurrent users → load balancer → 3 servers → all served (0% error rate)

Load Balancing Strategies

Strategy Comparison for AI

StrategyHow It WorksBest For AI?
Round RobinRotate requests across serversBad — ignores GPU load
Least ConnectionsSend to server with fewest active requestsGood — respects capacity
WeightedSend more to powerful serversGreat — A100 gets 3x traffic vs T4
IP HashSame client → same serverGood for session/context caching
Custom (GPU-aware)Route based on GPU utilizationBest — purpose-built for AI

Nginx Load Balancer Configuration

nginx
# /etc/nginx/nginx.conf
upstream llm_backends {
    least_conn;    # Route to server with fewest connections

    server 10.0.1.10:8000 weight=3;  # A100 server (3x capacity)
    server 10.0.1.11:8000 weight=1;  # T4 server (baseline)
    server 10.0.1.12:8000 weight=1;  # T4 server (baseline)
}

server {
    listen 80;
    server_name api.yoursite.com;

    # Longer timeouts for AI inference
    proxy_read_timeout 120s;
    proxy_connect_timeout 10s;
    proxy_send_timeout 30s;

    location /v1/ {
        proxy_pass http://llm_backends;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        # SSE streaming support
        proxy_buffering off;
        proxy_cache off;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }

    location /health {
        proxy_pass http://llm_backends;
    }
}

Health Checks

Nginx checks if backends are alive:

nginx
upstream llm_backends {
    least_conn;
    server 10.0.1.10:8000 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:8000 max_fails=3 fail_timeout=30s;
    # If a server fails 3 health checks in 30s, remove it from rotation
}

Auto-Scaling Patterns

Pattern 1: Kubernetes HPA (Horizontal Pod Autoscaler)

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-api
  minReplicas: 2
  maxReplicas: 10
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # Wait 60s before scaling up
      policies:
        - type: Pods
          value: 2                       # Add max 2 pods at a time
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 min before scaling down
      policies:
        - type: Pods
          value: 1                       # Remove max 1 pod at a time
          periodSeconds: 120
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: request_queue_depth
        target:
          type: AverageValue
          averageValue: "5"

Why asymmetric scaling?

  • Scale UP fast (60s) — don't lose requests
  • Scale DOWN slow (5min) — avoid thrashing (spin up → spin down → spin up)
  • AI model loading takes 30-120s — scaling too fast creates pods that aren't ready

Pattern 2: Queue-Based Scaling

Instead of CPU-based scaling, use request queue depth:

code
Queue depth < 5  → 2 replicas (baseline)
Queue depth 5-20 → 4 replicas
Queue depth > 20 → 8 replicas
Queue depth > 50 → 10 replicas (max)

This is better than CPU-based for AI because:

  • GPU utilization is what matters, not CPU
  • Queue depth directly measures user-facing latency pressure

Pattern 3: Scheduled Scaling

If your traffic is predictable:

yaml
# Scale up for business hours (Pakistan: 9 AM - 6 PM PKT)
apiVersion: autoscaling/v2
kind: CronHPA
metadata:
  name: business-hours
spec:
  schedules:
    - schedule: "0 9 * * 1-5"   # 9 AM Mon-Fri
      minReplicas: 4
      maxReplicas: 10
    - schedule: "0 18 * * 1-5"  # 6 PM Mon-Fri
      minReplicas: 2
      maxReplicas: 4
    - schedule: "0 0 * * 6-7"   # Weekends
      minReplicas: 1
      maxReplicas: 2

Handling Cold Starts

AI models take 30-120 seconds to load. During scale-up, new pods aren't ready immediately.

Solution: Warm Pool

Keep 1-2 "warm" pods with models pre-loaded but not receiving traffic:

yaml
spec:
  replicas: 3    # 2 active + 1 warm

  # Readiness probe: only route traffic when model is loaded
  readinessProbe:
    httpGet:
      path: /health
      port: 8000
    initialDelaySeconds: 90   # Model loading time
    periodSeconds: 10

Solution: Model Caching

Pre-download model weights to a shared volume:

yaml
volumes:
  - name: model-cache
    persistentVolumeClaim:
      claimName: shared-models

# All pods mount the same volume — no download needed per pod
# Model loading: 120s (download + load) → 30s (load from cache)

Monitoring & Alerting

Key Metrics to Track

MetricWhat It Tells YouAlert Threshold
Request latency (p95)User experience> 5s for chat, > 30s for generation
Error rateService health> 1% of requests
Queue depthCapacity pressure> 20 requests waiting
GPU utilizationResource efficiency< 30% (wasting money) or > 90% (overloaded)
Active replicasScaling behaviorUnexpected changes
VRAM usageMemory pressure> 90% of available

Alerting with Prometheus + AlertManager

yaml
# alert-rules.yaml
groups:
  - name: llm-api-alerts
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m])) > 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "LLM API p95 latency above 5 seconds"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
        for: 1m
        labels:
          severity: critical
Practice Lab

Practice Lab

Task 1: Nginx Load Balancer Set up Nginx as a reverse proxy in front of 2 FastAPI instances (can run on different ports on the same machine). Configure least_conn strategy and test with concurrent requests.

Task 2: Auto-Scaling Simulation Deploy your AI API on Kubernetes with HPA. Use a load testing tool to generate traffic ramps (10 → 50 → 100 → 10 req/sec) and observe pods scaling up and down.

Task 3: Cold Start Optimization Measure your model's cold start time. Implement the shared volume model cache approach and measure the improvement.

Pakistan Case Study

Meet Nadia — CTO of a Karachi AI startup offering Urdu text summarization as a service.

Her problem: A news aggregator client signed up. Traffic pattern: 500 req/min during morning news hours (7-9 AM), 50 req/min rest of the day. Fixed 4-server setup meant paying for peak capacity 24/7.

Her auto-scaling solution:

  • K3s cluster on 5 Hetzner VPS nodes (GPU-equipped)
  • HPA: min 1 replica, max 6, scale on queue depth
  • Scheduled scaling: pre-warm 3 replicas at 6:45 AM PKT
  • Shared model volume: cold start 90s → 25s

Results:

  • Morning spike handled without errors (auto-scaled to 5 replicas)
  • Off-peak: scaled down to 1 replica
  • Monthly infrastructure cost: PKR 80,000 → PKR 45,000 (44% savings)
  • Client SLA: 99.9% uptime achieved (vs. 98.5% before auto-scaling)

Key Takeaways

  • Load balancing distributes AI requests across multiple GPU servers
  • Use "least connections" strategy — round robin ignores GPU load differences
  • Auto-scale based on queue depth, not CPU — GPU utilization is what matters
  • Scale up fast (60s), scale down slow (5min) to avoid thrashing
  • Cold starts are the main challenge — use shared model volumes and warm pools
  • Scheduled scaling for predictable traffic saves money (pre-warm before peaks)
  • Monitor p95 latency, error rate, and GPU utilization — alert on thresholds

Next lesson: API gateway and authentication — protecting and monetizing your AI APIs.

Lesson Summary

Includes hands-on practice lab9 runnable code examples4-question knowledge check below

Quiz: Load Balancing & Auto-Scaling AI Services

4 questions to test your understanding. Score 60% or higher to pass.