Load Balancing & Auto-Scaling for LLM APIs

A single GPU server can handle 10-50 concurrent users before latency degrades. But what happens when your AI product goes viral — when the Karachi University chatbot you built gets mentioned on ARY News and 2,000 students hit it simultaneously? Without load balancing and auto-scaling, your service goes down and your client calls you in a panic. This lesson teaches you to build infrastructure that scales horizontally.

The Scaling Problem for LLMs

LLMs are different from regular web services. A typical web API call takes 10-50 milliseconds and is CPU-bound. An LLM generating 200 tokens takes 3-15 seconds and saturates the GPU. This means:

A single server has a hard ceiling on concurrent throughput
Overloading the GPU causes cascading latency spikes, not graceful degradation
GPU memory is non-elastic — you can't add "a bit more" GPU mid-request

The solution is horizontal scaling: run multiple identical inference servers behind a load balancer, and add/remove servers based on demand.

Load Balancing Strategies for LLM APIs

Round-robin: Requests are distributed sequentially across servers. Simple but ignores current server load. If server A is processing a long 2,000-token request and server B is idle, round-robin still sends the next request to server A.

Least-connections: Routes each new request to the server with the fewest active connections. Better for LLMs where request duration varies widely. This is the recommended default.

Weighted routing: Assign weights based on GPU capability. A server with an A100 (80 GB) might get weight=3 while a server with an RTX 3090 (24 GB) gets weight=1. Useful in heterogeneous fleets.

Nginx as an LLM Load Balancer

Nginx is free, battle-tested, and handles the load balancing for thousands of production systems. Here's a minimal config for routing between three vLLM instances:

nginx

upstream llm_backend {
    least_conn;  # use least-connections strategy
    server 127.0.0.1:8001;  # vLLM instance 1
    server 127.0.0.1:8002;  # vLLM instance 2
    server 127.0.0.1:8003;  # vLLM instance 3
    keepalive 32;
}

server {
    listen 80;
    location /v1/ {
        proxy_pass http://llm_backend;
        proxy_read_timeout 300;  # LLM responses take time
        proxy_buffering off;     # critical for streaming responses
    }
}

Note proxy_buffering off — this is essential for LLM streaming (token-by-token output). Without it, Nginx buffers the entire response before sending to the client, destroying the streaming experience.

Auto-Scaling on Cloud Providers

For Pakistani startups, the most cost-effective auto-scaling approach is using spot/preemptible GPU instances:

AWS Spot Instances: An g4dn.xlarge (T4 GPU, 16 GB VRAM) costs $0.16-0.40/hour spot vs. $0.53/hour on-demand. That's PKR 45-112/hour for a production-grade GPU server.
Hetzner GPU Cloud: European provider with competitive PKR-equivalent pricing. Available from Pakistan with international card.
Lambda Labs: Dedicated GPU cloud, excellent value for A10 and A100 instances.

A basic auto-scaling logic: monitor GPU utilization and queue depth. If the average request queue exceeds 5 pending requests for more than 60 seconds, spawn a new instance. If queue drops to 0 for 10 minutes, terminate the extra instance.

Health Checks and Circuit Breakers

Production load balancers need to know when a backend is unhealthy. vLLM exposes a health endpoint at /health. Configure Nginx or your load balancer to check this every 10 seconds:

nginx

upstream llm_backend {
    least_conn;
    server 127.0.0.1:8001 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:8002 max_fails=3 fail_timeout=30s;
}

If a server fails 3 health checks in 30 seconds, Nginx removes it from rotation automatically. This is your circuit breaker — it prevents routing traffic to a crashed GPU server.

Queue Management and Rate Limiting

Without rate limiting, a single bad actor (or a misconfigured client) can flood your service and starve other users. Implement queue-based rate limiting at the application layer:

python

from redis import Redis
import time

def rate_limit(client_id, max_requests=10, window_seconds=60):
    r = Redis()
    key = f"rate:{client_id}"
    current = r.incr(key)
    if current == 1:
        r.expire(key, window_seconds)
    if current > max_requests:
        raise Exception(f"Rate limit exceeded: {max_requests} requests per {window_seconds}s")

For Pakistani SaaS pricing tiers: free users get 10 requests/minute, paid users (PKR 2,000/month) get 60 requests/minute, enterprise (PKR 15,000/month) get unlimited. This maps directly to Redis rate-limit keys.

Monitoring with Prometheus + Grafana

Production LLM APIs need real-time monitoring. vLLM exposes Prometheus metrics at /metrics. A basic Grafana dashboard should track:

Tokens per second (throughput)
P50/P95/P99 request latency
GPU utilization %
Active requests in queue
Error rate (4xx/5xx responses)

Set alerts: if P99 latency exceeds 30 seconds or GPU utilization stays above 95% for 5 minutes, fire a PagerDuty/Slack alert.

Practice Lab

Start two vLLM instances on ports 8001 and 8002. Configure a minimal Nginx upstream with least_conn routing. Send 20 concurrent requests using Python's asyncio and verify the requests are distributed across both instances via the server logs.
Simulate a server failure: Kill one vLLM instance mid-test. Verify Nginx detects the failure (via max_fails health checking) and routes all traffic to the remaining instance within 30 seconds.
Implement Redis rate limiting: Set up a Redis instance (free Docker install) and wrap your API endpoint with the rate limiting function above. Test that the 11th request within 60 seconds returns an error.

Key Takeaways

Least-connections load balancing outperforms round-robin for LLMs because request duration varies widely
proxy_buffering off in Nginx is mandatory for streaming LLM responses
AWS Spot/Lambda Labs GPU instances can reduce infrastructure costs by 60-70% vs. on-demand pricing
Health checks + circuit breakers are not optional — they prevent one crashed GPU from taking down your entire service

5.2 — Load Balancing & Auto-Scaling for LLM APIs