Capstone: Deploy a Production LLM API on VPS

You've learned the theory. You've run local experiments. Now it's time to ship. In this capstone, you'll deploy a complete production LLM API — fine-tuned on Pakistani data, served through vLLM, protected by a load balancer, and accessible from any client via a standard OpenAI-compatible endpoint. By the end, you'll have a deployable asset that any Pakistani business can integrate with one line of Python.

The Architecture We're Building

code

Internet
    │
    ▼
Nginx (port 80/443) ← SSL via Let's Encrypt
    │
    ├── vLLM Instance 1 (port 8001) ← Base model + LoRA adapter
    └── vLLM Instance 2 (port 8002) ← Base model + LoRA adapter

Both instances share:
    - Model weights via /models/ directory (shared volume)
    - Redis for rate limiting
    - Prometheus for metrics

Client API call:
POST https://your-api.example.com/v1/chat/completions
Headers: Authorization: Bearer sk-your-key
Body: OpenAI-format JSON

Step 1 — Choose and Configure Your VPS

For Pakistani entrepreneurs, the most cost-effective options are:

Hetzner CX52 with GTX setup: ~€35/month (PKR 11,000/month) — CPU-only, good for smaller models with CPU inference
Vast.ai: Rent GPU by the hour. An RTX 3090 runs ~$0.35/hour (PKR 98/hour). Run 8 hours/day = PKR 23,000/month.
RunPod.io: Similar to Vast.ai with better uptime SLAs. RTX 4090 at $0.44/hour.
Lambda Labs: A10 GPU (24 GB VRAM) at $0.60/hour — best for 13B+ models.

For a startup serving < 1,000 requests/day, a single RTX 3090 on Vast.ai at 8 hours/day of on-demand time is the minimum viable production setup at the lowest cost.

Step 2 — Server Setup Script

bash

#!/bin/bash
# server_setup.sh — Run once on fresh Ubuntu 22.04

# System updates
apt update && apt upgrade -y
apt install -y nginx redis-server certbot python3-certbot-nginx

# Python environment
apt install -y python3.11 python3.11-venv python3.11-pip
python3.11 -m venv /opt/llm-api-env
source /opt/llm-api-env/bin/activate

# Install vLLM and serving stack
pip install vllm transformers peft fastapi uvicorn redis python-jose

# Create model directory
mkdir -p /models/base /models/adapters

echo "Setup complete. Upload model files to /models/"

Step 3 — Systemd Service for vLLM

Systemd ensures your vLLM server restarts automatically after crashes or reboots:

ini

# /etc/systemd/system/vllm-1.service
[Unit]
Description=vLLM Inference Server Instance 1
After=network.target

[Service]
Type=simple
User=ubuntu
WorkingDirectory=/opt/llm-api-env
ExecStart=/opt/llm-api-env/bin/python -m vllm.entrypoints.openai.api_server \
    --model /models/base/Meta-Llama-3-8B \
    --enable-lora \
    --lora-modules karachi-bot=/models/adapters/karachi-llm-v1-adapter \
    --port 8001 \
    --host 127.0.0.1
Restart=always
RestartSec=10
Environment=CUDA_VISIBLE_DEVICES=0

[Install]
WantedBy=multi-user.target

bash

systemctl daemon-reload
systemctl enable vllm-1
systemctl start vllm-1

Step 4 — API Authentication Wrapper

vLLM doesn't natively support API key auth. Wrap it with a FastAPI proxy:

python

# api_gateway.py
from fastapi import FastAPI, HTTPException, Header
from redis import Redis
import httpx

app = FastAPI()
r = Redis()

VALID_KEYS = {"sk-karachi-demo": {"tier": "free", "rpm": 10},
              "sk-enterprise-001": {"tier": "enterprise", "rpm": 999}}

async def authenticate(authorization: str = Header(None)):
    if not authorization or not authorization.startswith("Bearer "):
        raise HTTPException(401, "Invalid auth")
    key = authorization.split(" ")[1]
    if key not in VALID_KEYS:
        raise HTTPException(403, "Invalid API key")
    return VALID_KEYS[key]

@app.post("/v1/chat/completions")
async def proxy_completion(request: dict, key_info: dict = Depends(authenticate)):
    # Rate check
    rpm = key_info["rpm"]
    count = r.incr(f"rpm:{key_info}")
    if count == 1: r.expire(f"rpm:{key_info}", 60)
    if count > rpm:
        raise HTTPException(429, "Rate limit exceeded")

    # Forward to vLLM
    async with httpx.AsyncClient() as client:
        resp = await client.post("http://localhost:8001/v1/chat/completions",
                                  json=request, timeout=120)
    return resp.json()

Step 5 — SSL with Let's Encrypt

bash

# Point your domain's A record to your VPS IP first
certbot --nginx -d api.your-domain.com

# Nginx auto-configures HTTPS. Verify:
curl https://api.your-domain.com/health

Step 6 — Testing End-to-End

python

import openai

client = openai.OpenAI(
    api_key="sk-karachi-demo",
    base_url="https://api.your-domain.com/v1"
)

response = client.chat.completions.create(
    model="karachi-bot",  # the LoRA adapter name
    messages=[{"role": "user", "content": "DHA Phase 6 mein 5 marla plot ka rate kya hai aajkal?"}]
)
print(response.choices[0].message.content)

Capstone Deliverables

By completing this capstone, you will have:

A live HTTPS endpoint serving a Pakistani-fine-tuned LLM
API key authentication with rate limiting tiers
Auto-restart via systemd
Nginx load balancing (even if pointing to one instance initially, the architecture is ready to scale)
End-to-end test proving the model responds in Pakistani English/Roman Urdu appropriately

Practice Lab

Deploy on Vast.ai: Rent an RTX 3090 for 2 hours (~PKR 200). Run the full setup script. Confirm vLLM starts and responds to a test request.
Configure the FastAPI gateway: Add two API keys — one free tier (10 RPM), one paid tier (60 RPM). Test that exceeding the rate limit returns HTTP 429.
Measure your final benchmarks: Use wrk or locust to send 50 concurrent requests and record your P50 and P99 latency. Document what GPU configuration you used and the cost per hour. This is your "production spec sheet" for future client proposals.

Key Takeaways

Vast.ai / RunPod offer 60-70% cost savings vs. AWS on-demand for production LLM serving in Pakistan's budget range
Systemd service management ensures zero-downtime restarts and auto-recovery from crashes
Always add an API gateway layer for authentication, rate limiting, and logging — never expose vLLM directly to the internet
A production-ready LLM API can be deployed for PKR 200-500/day on rented GPU compute — commercially viable for almost any Pakistani startup

5.3 — Capstone: Deploy a Production LLM API on VPS

Capstone: Deploy a Production LLM API on VPS

The Architecture We're Building

Step 1 — Choose and Configure Your VPS

Step 2 — Server Setup Script

Step 3 — Systemd Service for vLLM

Step 4 — API Authentication Wrapper

Step 5 — SSL with Let's Encrypt

Step 6 — Testing End-to-End

Capstone Deliverables

Practice Lab

Key Takeaways

Lesson Summary

Quiz: Capstone — Deploy a Production LLM API on VPS