5.3 — Capstone: Deploy a Production LLM API on VPS
Capstone: Deploy a Production LLM API on VPS
You've learned the theory. You've run local experiments. Now it's time to ship. In this capstone, you'll deploy a complete production LLM API — fine-tuned on Pakistani data, served through vLLM, protected by a load balancer, and accessible from any client via a standard OpenAI-compatible endpoint. By the end, you'll have a deployable asset that any Pakistani business can integrate with one line of Python.
The Architecture We're Building
Internet
│
▼
Nginx (port 80/443) ← SSL via Let's Encrypt
│
├── vLLM Instance 1 (port 8001) ← Base model + LoRA adapter
└── vLLM Instance 2 (port 8002) ← Base model + LoRA adapter
Both instances share:
- Model weights via /models/ directory (shared volume)
- Redis for rate limiting
- Prometheus for metrics
Client API call:
POST https://your-api.example.com/v1/chat/completions
Headers: Authorization: Bearer sk-your-key
Body: OpenAI-format JSON
Step 1 — Choose and Configure Your VPS
For Pakistani entrepreneurs, the most cost-effective options are:
- Hetzner CX52 with GTX setup: ~€35/month (PKR 11,000/month) — CPU-only, good for smaller models with CPU inference
- Vast.ai: Rent GPU by the hour. An RTX 3090 runs ~$0.35/hour (PKR 98/hour). Run 8 hours/day = PKR 23,000/month.
- RunPod.io: Similar to Vast.ai with better uptime SLAs. RTX 4090 at $0.44/hour.
- Lambda Labs: A10 GPU (24 GB VRAM) at $0.60/hour — best for 13B+ models.
For a startup serving < 1,000 requests/day, a single RTX 3090 on Vast.ai at 8 hours/day of on-demand time is the minimum viable production setup at the lowest cost.
Step 2 — Server Setup Script
#!/bin/bash
# server_setup.sh — Run once on fresh Ubuntu 22.04
# System updates
apt update && apt upgrade -y
apt install -y nginx redis-server certbot python3-certbot-nginx
# Python environment
apt install -y python3.11 python3.11-venv python3.11-pip
python3.11 -m venv /opt/llm-api-env
source /opt/llm-api-env/bin/activate
# Install vLLM and serving stack
pip install vllm transformers peft fastapi uvicorn redis python-jose
# Create model directory
mkdir -p /models/base /models/adapters
echo "Setup complete. Upload model files to /models/"
Step 3 — Systemd Service for vLLM
Systemd ensures your vLLM server restarts automatically after crashes or reboots:
# /etc/systemd/system/vllm-1.service
[Unit]
Description=vLLM Inference Server Instance 1
After=network.target
[Service]
Type=simple
User=ubuntu
WorkingDirectory=/opt/llm-api-env
ExecStart=/opt/llm-api-env/bin/python -m vllm.entrypoints.openai.api_server \
--model /models/base/Meta-Llama-3-8B \
--enable-lora \
--lora-modules karachi-bot=/models/adapters/karachi-llm-v1-adapter \
--port 8001 \
--host 127.0.0.1
Restart=always
RestartSec=10
Environment=CUDA_VISIBLE_DEVICES=0
[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable vllm-1
systemctl start vllm-1
Step 4 — API Authentication Wrapper
vLLM doesn't natively support API key auth. Wrap it with a FastAPI proxy:
# api_gateway.py
from fastapi import FastAPI, HTTPException, Header
from redis import Redis
import httpx
app = FastAPI()
r = Redis()
VALID_KEYS = {"sk-karachi-demo": {"tier": "free", "rpm": 10},
"sk-enterprise-001": {"tier": "enterprise", "rpm": 999}}
async def authenticate(authorization: str = Header(None)):
if not authorization or not authorization.startswith("Bearer "):
raise HTTPException(401, "Invalid auth")
key = authorization.split(" ")[1]
if key not in VALID_KEYS:
raise HTTPException(403, "Invalid API key")
return VALID_KEYS[key]
@app.post("/v1/chat/completions")
async def proxy_completion(request: dict, key_info: dict = Depends(authenticate)):
# Rate check
rpm = key_info["rpm"]
count = r.incr(f"rpm:{key_info}")
if count == 1: r.expire(f"rpm:{key_info}", 60)
if count > rpm:
raise HTTPException(429, "Rate limit exceeded")
# Forward to vLLM
async with httpx.AsyncClient() as client:
resp = await client.post("http://localhost:8001/v1/chat/completions",
json=request, timeout=120)
return resp.json()
Step 5 — SSL with Let's Encrypt
# Point your domain's A record to your VPS IP first
certbot --nginx -d api.your-domain.com
# Nginx auto-configures HTTPS. Verify:
curl https://api.your-domain.com/health
Step 6 — Testing End-to-End
import openai
client = openai.OpenAI(
api_key="sk-karachi-demo",
base_url="https://api.your-domain.com/v1"
)
response = client.chat.completions.create(
model="karachi-bot", # the LoRA adapter name
messages=[{"role": "user", "content": "DHA Phase 6 mein 5 marla plot ka rate kya hai aajkal?"}]
)
print(response.choices[0].message.content)
Capstone Deliverables
By completing this capstone, you will have:
- A live HTTPS endpoint serving a Pakistani-fine-tuned LLM
- API key authentication with rate limiting tiers
- Auto-restart via systemd
- Nginx load balancing (even if pointing to one instance initially, the architecture is ready to scale)
- End-to-end test proving the model responds in Pakistani English/Roman Urdu appropriately
Practice Lab
-
Deploy on Vast.ai: Rent an RTX 3090 for 2 hours (~PKR 200). Run the full setup script. Confirm vLLM starts and responds to a test request.
-
Configure the FastAPI gateway: Add two API keys — one free tier (10 RPM), one paid tier (60 RPM). Test that exceeding the rate limit returns HTTP 429.
-
Measure your final benchmarks: Use
wrkorlocustto send 50 concurrent requests and record your P50 and P99 latency. Document what GPU configuration you used and the cost per hour. This is your "production spec sheet" for future client proposals.
Key Takeaways
- Vast.ai / RunPod offer 60-70% cost savings vs. AWS on-demand for production LLM serving in Pakistan's budget range
- Systemd service management ensures zero-downtime restarts and auto-recovery from crashes
- Always add an API gateway layer for authentication, rate limiting, and logging — never expose vLLM directly to the internet
- A production-ready LLM API can be deployed for PKR 200-500/day on rented GPU compute — commercially viable for almost any Pakistani startup
Lesson Summary
Quiz: Capstone — Deploy a Production LLM API on VPS
4 questions to test your understanding. Score 60% or higher to pass.