AI Infrastructure & Local LLMsModule 8

8.2Spot Instances, Preemptible VMs & Budget Strategies

25 min 8 code blocks Practice Lab Quiz (4Q)

Spot Instances, Preemptible VMs & Budget Strategies

Cloud GPUs are expensive at on-demand rates. But cloud providers have a secret: unused capacity they'll sell at 60-90% discount — if you're willing to accept that your instance might be interrupted. These are called spot instances (AWS), preemptible VMs (GCP), and spot VMs (Azure). For AI workloads that can tolerate interruption, this is the single biggest cost-saving strategy.

How Spot/Preemptible Pricing Works

code
On-Demand: "I need a GPU now, guaranteed, no interruption."
→ Full price: $1.00/hour

Spot: "I'll take whatever GPU capacity you have left over."
→ Discounted: $0.20-0.40/hour (60-80% off)
→ Risk: Instance can be reclaimed with 2 min warning

Committed: "I'll pay monthly for 1-3 years, guaranteed."
→ Discounted: $0.60/hour (40% off)
→ Risk: You pay even if you don't use it

Spot Instance Pricing Comparison

AWS Spot Instances

InstanceGPUOn-Demand/hrSpot/hrSavingsMonthly Spot
g4dn.xlargeT4$0.526$0.15870%$114 (PKR 32,000)
g5.xlargeA10G$1.006$0.30270%$217 (PKR 61,000)
p3.2xlargeV100$3.06$0.91870%$661 (PKR 185,000)

GCP Preemptible/Spot VMs

InstanceGPUOn-Demand/hrSpot/hrSavingsMonthly Spot
n1 + T4T4$0.35$0.1169%$79 (PKR 22,000)
g2-standard-4L4$0.74$0.2270%$158 (PKR 44,000)
a2-highgpu-1gA100$3.67$1.1070%$792 (PKR 222,000)

Which AI Workloads Can Use Spot?

WorkloadSpot Safe?Why
Model trainingYes (with checkpointing)Can resume from last checkpoint
Batch inferenceYesReprocess failed items
Fine-tuningYes (with checkpointing)Same as training
Data preprocessingYesStateless, retryable
Dev/testingYesNo production impact
Real-time APIRiskyInterruption = downtime
Low-latency APINoNeed guaranteed availability

The Golden Rule

Use spot for anything that can be checkpointed and resumed or retried without loss. Use on-demand or dedicated for anything that must never go down.

Handling Spot Interruptions

Checkpointing for Training

Save model state every N steps so interruption only loses a few minutes of work:

python
# PyTorch checkpoint saving
import torch

def save_checkpoint(model, optimizer, epoch, step, loss, path):
    torch.save({
        'epoch': epoch,
        'step': step,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }, path)

def load_checkpoint(model, optimizer, path):
    checkpoint = torch.load(path)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    return checkpoint['epoch'], checkpoint['step']

# Save every 500 steps
for step, batch in enumerate(dataloader):
    loss = train_step(model, batch)
    if step % 500 == 0:
        save_checkpoint(model, optimizer, epoch, step, loss,
                       f"checkpoints/step_{step}.pt")

AWS Spot Interruption Handler

AWS gives a 2-minute warning before terminating:

python
import requests
import signal

def check_spot_interruption():
    """Check if AWS is about to reclaim this instance."""
    try:
        r = requests.get(
            "http://169.254.169.254/latest/meta-data/spot/instance-action",
            timeout=1
        )
        if r.status_code == 200:
            return True  # Interruption coming!
    except:
        pass
    return False

# In your training loop
for step, batch in enumerate(dataloader):
    if step % 10 == 0 and check_spot_interruption():
        print("SPOT INTERRUPTION DETECTED — saving checkpoint...")
        save_checkpoint(model, optimizer, epoch, step, loss,
                       "checkpoints/emergency.pt")
        break

    loss = train_step(model, batch)

GCP Preemptible Shutdown Script

bash
#!/bin/bash
# /etc/init.d/preemptible-shutdown.sh
# GCP calls this 30 seconds before termination

echo "Preemptible VM shutting down — saving state..."
python3 /app/save_emergency_checkpoint.py
aws s3 cp /checkpoints/emergency.pt s3://my-bucket/checkpoints/
echo "Checkpoint saved to S3."

Budget Strategies

Strategy 1: Spot Fleet (AWS)

Request multiple instance types — AWS picks the cheapest available:

json
{
  "SpotFleetRequestConfig": {
    "TargetCapacity": 4,
    "LaunchSpecifications": [
      {"InstanceType": "g4dn.xlarge"},
      {"InstanceType": "g4dn.2xlarge"},
      {"InstanceType": "g5.xlarge"}
    ],
    "AllocationStrategy": "lowestPrice"
  }
}

AWS finds the cheapest GPU instance available right now and provisions it.

Strategy 2: Time-of-Day Optimization

GPU spot prices fluctuate by time of day:

code
US business hours (9 AM - 5 PM EST): High demand → Higher spot prices
US night / weekend: Low demand → Lowest spot prices
Pakistan daytime = US nighttime = CHEAPEST GPU prices

Schedule training jobs to run during Pakistan daytime (US nighttime) for lowest spot costs.

Strategy 3: Budget Alerts

Never get surprised by a cloud bill:

bash
# AWS Budget Alert (CLI)
aws budgets create-budget \
    --account-id 123456789 \
    --budget '{
        "BudgetName": "GPU-Monthly",
        "BudgetLimit": {"Amount": "200", "Unit": "USD"},
        "TimeUnit": "MONTHLY",
        "BudgetType": "COST"
    }' \
    --notifications-with-subscribers '[{
        "Notification": {
            "NotificationType": "ACTUAL",
            "ComparisonOperator": "GREATER_THAN",
            "Threshold": 80
        },
        "Subscribers": [{
            "SubscriptionType": "EMAIL",
            "Address": "your@email.com"
        }]
    }]'

Strategy 4: Reserved/Committed Use Discounts

If you know you'll use a GPU for 1+ year:

ProviderCommitmentDiscountBest For
AWS Reserved1 year40%Predictable production
AWS Reserved3 year60%Long-term projects
GCP Committed1 year37%Steady workloads
GCP Committed3 year55%Established products

Strategy 5: The Hybrid Stack

code
Training (batch):      Spot instances (70% off)
Fine-tuning (batch):   Spot instances (70% off)
Production API:        Hetzner dedicated (fixed PKR 40K/month)
Traffic spikes:        Cloud on-demand (pay per burst)
Development:           Local GPU (PKR 5K electricity)

Cost Monitoring Tools

ToolCostWhat It Does
AWS Cost ExplorerFreeVisualize AWS spending by service
GCP Billing ConsoleFreeSame for GCP
InfracostFree (OSS)Estimate cost of Terraform changes before deploying
KubecostFree tierK8s cost monitoring per namespace/pod
Custom dashboardFreePrometheus + Grafana with billing metrics
Practice Lab

Practice Lab

Task 1: Spot vs. On-Demand Calculator Calculate the monthly cost difference between on-demand and spot for your specific workload on AWS and GCP. Include the cost of lost work from 2 spot interruptions per day (assuming 500-step checkpointing).

Task 2: Checkpointing System Implement checkpoint saving/loading for a PyTorch training loop. Add spot interruption detection (use a mock for local testing). Verify you can stop training and resume from the last checkpoint.

Task 3: Budget Alert Setup Set up a budget alert on AWS or GCP (free tier) that emails you when spending exceeds $50/month. Configure a second alert at 80% of your budget.

Pakistan Case Study

Meet Sana — a data scientist at a Karachi AI company training custom Urdu language models.

Her training cost problem:

  • Fine-tuning a 13B model on AWS p3.2xlarge (V100): $3.06/hour
  • Training takes ~72 hours = $220 per training run
  • She runs 3-4 experiments per week = $660-880/week = PKR 185,000-246,000/week

Her spot instance strategy:

  • Switched to spot instances: $0.92/hour (70% savings)
  • Added checkpointing every 200 steps (saves to S3)
  • Average of 1 spot interruption per training run (loses ~10 minutes of work)
  • Trained during Pakistan daytime (US nighttime = lowest spot prices)

Results:

  • Training cost per run: $220 → $66 (70% savings)
  • Weekly training cost: $880 → $264 (PKR 74,000)
  • Monthly savings: PKR 688,000
  • The 10 minutes lost per interruption is trivial — checkpoint resume takes 2 minutes
  • She now runs 2x more experiments with the same budget

Key Takeaways

  • Spot/preemptible instances save 60-80% on GPU costs — the single biggest optimization
  • Only use spot for interruptible workloads: training, fine-tuning, batch inference
  • Always checkpoint during training — every 200-500 steps minimum
  • Handle interruption gracefully: save checkpoint → upload to S3 → resume on new instance
  • Pakistan timezone advantage: train during local daytime = US nighttime = cheapest spot prices
  • Set budget alerts to avoid surprise bills — cloud spending grows silently
  • The hybrid stack (spot for training, dedicated for production, local for dev) minimizes total cost

Next lesson: Building a cost-optimized AI pipeline from end to end.

Lesson Summary

Includes hands-on practice lab8 runnable code examples4-question knowledge check below

Quiz: Spot Instances, Preemptible VMs & Budget Strategies

4 questions to test your understanding. Score 60% or higher to pass.