5.2 — Load Balancing & Auto-Scaling for LLM APIs
Load Balancing & Auto-Scaling for LLM APIs
A single GPU server can handle 10-50 concurrent users before latency degrades. But what happens when your AI product goes viral — when the Karachi University chatbot you built gets mentioned on ARY News and 2,000 students hit it simultaneously? Without load balancing and auto-scaling, your service goes down and your client calls you in a panic. This lesson teaches you to build infrastructure that scales horizontally.
The Scaling Problem for LLMs
LLMs are different from regular web services. A typical web API call takes 10-50 milliseconds and is CPU-bound. An LLM generating 200 tokens takes 3-15 seconds and saturates the GPU. This means:
- A single server has a hard ceiling on concurrent throughput
- Overloading the GPU causes cascading latency spikes, not graceful degradation
- GPU memory is non-elastic — you can't add "a bit more" GPU mid-request
The solution is horizontal scaling: run multiple identical inference servers behind a load balancer, and add/remove servers based on demand.
Load Balancing Strategies for LLM APIs
Round-robin: Requests are distributed sequentially across servers. Simple but ignores current server load. If server A is processing a long 2,000-token request and server B is idle, round-robin still sends the next request to server A.
Least-connections: Routes each new request to the server with the fewest active connections. Better for LLMs where request duration varies widely. This is the recommended default.
Weighted routing: Assign weights based on GPU capability. A server with an A100 (80 GB) might get weight=3 while a server with an RTX 3090 (24 GB) gets weight=1. Useful in heterogeneous fleets.
Nginx as an LLM Load Balancer
Nginx is free, battle-tested, and handles the load balancing for thousands of production systems. Here's a minimal config for routing between three vLLM instances:
upstream llm_backend {
least_conn; # use least-connections strategy
server 127.0.0.1:8001; # vLLM instance 1
server 127.0.0.1:8002; # vLLM instance 2
server 127.0.0.1:8003; # vLLM instance 3
keepalive 32;
}
server {
listen 80;
location /v1/ {
proxy_pass http://llm_backend;
proxy_read_timeout 300; # LLM responses take time
proxy_buffering off; # critical for streaming responses
}
}
Note proxy_buffering off — this is essential for LLM streaming (token-by-token output). Without it, Nginx buffers the entire response before sending to the client, destroying the streaming experience.
Auto-Scaling on Cloud Providers
For Pakistani startups, the most cost-effective auto-scaling approach is using spot/preemptible GPU instances:
- AWS Spot Instances: An
g4dn.xlarge(T4 GPU, 16 GB VRAM) costs $0.16-0.40/hour spot vs. $0.53/hour on-demand. That's PKR 45-112/hour for a production-grade GPU server. - Hetzner GPU Cloud: European provider with competitive PKR-equivalent pricing. Available from Pakistan with international card.
- Lambda Labs: Dedicated GPU cloud, excellent value for A10 and A100 instances.
A basic auto-scaling logic: monitor GPU utilization and queue depth. If the average request queue exceeds 5 pending requests for more than 60 seconds, spawn a new instance. If queue drops to 0 for 10 minutes, terminate the extra instance.
Health Checks and Circuit Breakers
Production load balancers need to know when a backend is unhealthy. vLLM exposes a health endpoint at /health. Configure Nginx or your load balancer to check this every 10 seconds:
upstream llm_backend {
least_conn;
server 127.0.0.1:8001 max_fails=3 fail_timeout=30s;
server 127.0.0.1:8002 max_fails=3 fail_timeout=30s;
}
If a server fails 3 health checks in 30 seconds, Nginx removes it from rotation automatically. This is your circuit breaker — it prevents routing traffic to a crashed GPU server.
Queue Management and Rate Limiting
Without rate limiting, a single bad actor (or a misconfigured client) can flood your service and starve other users. Implement queue-based rate limiting at the application layer:
from redis import Redis
import time
def rate_limit(client_id, max_requests=10, window_seconds=60):
r = Redis()
key = f"rate:{client_id}"
current = r.incr(key)
if current == 1:
r.expire(key, window_seconds)
if current > max_requests:
raise Exception(f"Rate limit exceeded: {max_requests} requests per {window_seconds}s")
For Pakistani SaaS pricing tiers: free users get 10 requests/minute, paid users (PKR 2,000/month) get 60 requests/minute, enterprise (PKR 15,000/month) get unlimited. This maps directly to Redis rate-limit keys.
Monitoring with Prometheus + Grafana
Production LLM APIs need real-time monitoring. vLLM exposes Prometheus metrics at /metrics. A basic Grafana dashboard should track:
- Tokens per second (throughput)
- P50/P95/P99 request latency
- GPU utilization %
- Active requests in queue
- Error rate (4xx/5xx responses)
Set alerts: if P99 latency exceeds 30 seconds or GPU utilization stays above 95% for 5 minutes, fire a PagerDuty/Slack alert.
Practice Lab
-
Start two vLLM instances on ports 8001 and 8002. Configure a minimal Nginx upstream with
least_connrouting. Send 20 concurrent requests using Python'sasyncioand verify the requests are distributed across both instances via the server logs. -
Simulate a server failure: Kill one vLLM instance mid-test. Verify Nginx detects the failure (via
max_failshealth checking) and routes all traffic to the remaining instance within 30 seconds. -
Implement Redis rate limiting: Set up a Redis instance (free Docker install) and wrap your API endpoint with the rate limiting function above. Test that the 11th request within 60 seconds returns an error.
Key Takeaways
- Least-connections load balancing outperforms round-robin for LLMs because request duration varies widely
proxy_buffering offin Nginx is mandatory for streaming LLM responses- AWS Spot/Lambda Labs GPU instances can reduce infrastructure costs by 60-70% vs. on-demand pricing
- Health checks + circuit breakers are not optional — they prevent one crashed GPU from taking down your entire service
Lesson Summary
Quiz: Load Balancing & Auto-Scaling for LLM APIs
4 questions to test your understanding. Score 60% or higher to pass.