Spot Instances, Preemptible VMs & Budget Strategies

Cloud GPUs are expensive at on-demand rates. But cloud providers have a secret: unused capacity they'll sell at 60-90% discount — if you're willing to accept that your instance might be interrupted. These are called spot instances (AWS), preemptible VMs (GCP), and spot VMs (Azure). For AI workloads that can tolerate interruption, this is the single biggest cost-saving strategy.

How Spot/Preemptible Pricing Works

code

On-Demand: "I need a GPU now, guaranteed, no interruption."
→ Full price: $1.00/hour

Spot: "I'll take whatever GPU capacity you have left over."
→ Discounted: $0.20-0.40/hour (60-80% off)
→ Risk: Instance can be reclaimed with 2 min warning

Committed: "I'll pay monthly for 1-3 years, guaranteed."
→ Discounted: $0.60/hour (40% off)
→ Risk: You pay even if you don't use it

Spot Instance Pricing Comparison

AWS Spot Instances

Instance	GPU	On-Demand/hr	Spot/hr	Savings	Monthly Spot
g4dn.xlarge	T4	$0.526	$0.158	70%	$114 (PKR 32,000)
g5.xlarge	A10G	$1.006	$0.302	70%	$217 (PKR 61,000)
p3.2xlarge	V100	$3.06	$0.918	70%	$661 (PKR 185,000)

GCP Preemptible/Spot VMs

Instance	GPU	On-Demand/hr	Spot/hr	Savings	Monthly Spot
n1 + T4	T4	$0.35	$0.11	69%	$79 (PKR 22,000)
g2-standard-4	L4	$0.74	$0.22	70%	$158 (PKR 44,000)
a2-highgpu-1g	A100	$3.67	$1.10	70%	$792 (PKR 222,000)

Which AI Workloads Can Use Spot?

Workload	Spot Safe?	Why
Model training	Yes (with checkpointing)	Can resume from last checkpoint
Batch inference	Yes	Reprocess failed items
Fine-tuning	Yes (with checkpointing)	Same as training
Data preprocessing	Yes	Stateless, retryable
Dev/testing	Yes	No production impact
Real-time API	Risky	Interruption = downtime
Low-latency API	No	Need guaranteed availability

The Golden Rule

Use spot for anything that can be checkpointed and resumed or retried without loss. Use on-demand or dedicated for anything that must never go down.

Handling Spot Interruptions

Checkpointing for Training

Save model state every N steps so interruption only loses a few minutes of work:

python

# PyTorch checkpoint saving
import torch

def save_checkpoint(model, optimizer, epoch, step, loss, path):
    torch.save({
        'epoch': epoch,
        'step': step,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }, path)

def load_checkpoint(model, optimizer, path):
    checkpoint = torch.load(path)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    return checkpoint['epoch'], checkpoint['step']

# Save every 500 steps
for step, batch in enumerate(dataloader):
    loss = train_step(model, batch)
    if step % 500 == 0:
        save_checkpoint(model, optimizer, epoch, step, loss,
                       f"checkpoints/step_{step}.pt")

AWS Spot Interruption Handler

AWS gives a 2-minute warning before terminating:

python

import requests
import signal

def check_spot_interruption():
    """Check if AWS is about to reclaim this instance."""
    try:
        r = requests.get(
            "http://169.254.169.254/latest/meta-data/spot/instance-action",
            timeout=1
        )
        if r.status_code == 200:
            return True  # Interruption coming!
    except:
        pass
    return False

# In your training loop
for step, batch in enumerate(dataloader):
    if step % 10 == 0 and check_spot_interruption():
        print("SPOT INTERRUPTION DETECTED — saving checkpoint...")
        save_checkpoint(model, optimizer, epoch, step, loss,
                       "checkpoints/emergency.pt")
        break

    loss = train_step(model, batch)

GCP Preemptible Shutdown Script

bash

#!/bin/bash
# /etc/init.d/preemptible-shutdown.sh
# GCP calls this 30 seconds before termination

echo "Preemptible VM shutting down — saving state..."
python3 /app/save_emergency_checkpoint.py
aws s3 cp /checkpoints/emergency.pt s3://my-bucket/checkpoints/
echo "Checkpoint saved to S3."

Budget Strategies

Strategy 1: Spot Fleet (AWS)

Request multiple instance types — AWS picks the cheapest available:

json

{
  "SpotFleetRequestConfig": {
    "TargetCapacity": 4,
    "LaunchSpecifications": [
      {"InstanceType": "g4dn.xlarge"},
      {"InstanceType": "g4dn.2xlarge"},
      {"InstanceType": "g5.xlarge"}
    ],
    "AllocationStrategy": "lowestPrice"
  }
}

AWS finds the cheapest GPU instance available right now and provisions it.

Strategy 2: Time-of-Day Optimization

GPU spot prices fluctuate by time of day:

code

US business hours (9 AM - 5 PM EST): High demand → Higher spot prices
US night / weekend: Low demand → Lowest spot prices
Pakistan daytime = US nighttime = CHEAPEST GPU prices

Schedule training jobs to run during Pakistan daytime (US nighttime) for lowest spot costs.

Strategy 3: Budget Alerts

Never get surprised by a cloud bill:

bash

# AWS Budget Alert (CLI)
aws budgets create-budget \
    --account-id 123456789 \
    --budget '{
        "BudgetName": "GPU-Monthly",
        "BudgetLimit": {"Amount": "200", "Unit": "USD"},
        "TimeUnit": "MONTHLY",
        "BudgetType": "COST"
    }' \
    --notifications-with-subscribers '[{
        "Notification": {
            "NotificationType": "ACTUAL",
            "ComparisonOperator": "GREATER_THAN",
            "Threshold": 80
        },
        "Subscribers": [{
            "SubscriptionType": "EMAIL",
            "Address": "your@email.com"
        }]
    }]'

Strategy 4: Reserved/Committed Use Discounts

If you know you'll use a GPU for 1+ year:

Provider	Commitment	Discount	Best For
AWS Reserved	1 year	40%	Predictable production
AWS Reserved	3 year	60%	Long-term projects
GCP Committed	1 year	37%	Steady workloads
GCP Committed	3 year	55%	Established products

Strategy 5: The Hybrid Stack

code

Training (batch):      Spot instances (70% off)
Fine-tuning (batch):   Spot instances (70% off)
Production API:        Hetzner dedicated (fixed PKR 40K/month)
Traffic spikes:        Cloud on-demand (pay per burst)
Development:           Local GPU (PKR 5K electricity)

Cost Monitoring Tools

Tool	Cost	What It Does
AWS Cost Explorer	Free	Visualize AWS spending by service
GCP Billing Console	Free	Same for GCP
Infracost	Free (OSS)	Estimate cost of Terraform changes before deploying
Kubecost	Free tier	K8s cost monitoring per namespace/pod
Custom dashboard	Free	Prometheus + Grafana with billing metrics

Practice Lab

Task 1: Spot vs. On-Demand Calculator Calculate the monthly cost difference between on-demand and spot for your specific workload on AWS and GCP. Include the cost of lost work from 2 spot interruptions per day (assuming 500-step checkpointing).

Task 2: Checkpointing System Implement checkpoint saving/loading for a PyTorch training loop. Add spot interruption detection (use a mock for local testing). Verify you can stop training and resume from the last checkpoint.

Task 3: Budget Alert Setup Set up a budget alert on AWS or GCP (free tier) that emails you when spending exceeds $50/month. Configure a second alert at 80% of your budget.

Pakistan Case Study

Meet Sana — a data scientist at a Karachi AI company training custom Urdu language models.

Her training cost problem:

Fine-tuning a 13B model on AWS p3.2xlarge (V100): $3.06/hour
Training takes ~72 hours = $220 per training run
She runs 3-4 experiments per week = $660-880/week = PKR 185,000-246,000/week

Her spot instance strategy:

Switched to spot instances: $0.92/hour (70% savings)
Added checkpointing every 200 steps (saves to S3)
Average of 1 spot interruption per training run (loses ~10 minutes of work)
Trained during Pakistan daytime (US nighttime = lowest spot prices)

Results:

Training cost per run: $220 → $66 (70% savings)
Weekly training cost: $880 → $264 (PKR 74,000)
Monthly savings: PKR 688,000
The 10 minutes lost per interruption is trivial — checkpoint resume takes 2 minutes
She now runs 2x more experiments with the same budget

Key Takeaways

Spot/preemptible instances save 60-80% on GPU costs — the single biggest optimization
Only use spot for interruptible workloads: training, fine-tuning, batch inference
Always checkpoint during training — every 200-500 steps minimum
Handle interruption gracefully: save checkpoint → upload to S3 → resume on new instance
Pakistan timezone advantage: train during local daytime = US nighttime = cheapest spot prices
Set budget alerts to avoid surprise bills — cloud spending grows silently
The hybrid stack (spot for training, dedicated for production, local for dev) minimizes total cost

Next lesson: Building a cost-optimized AI pipeline from end to end.

8.2 — Spot Instances, Preemptible VMs & Budget Strategies

Spot Instances, Preemptible VMs & Budget Strategies

How Spot/Preemptible Pricing Works

Spot Instance Pricing Comparison

AWS Spot Instances

GCP Preemptible/Spot VMs

Which AI Workloads Can Use Spot?

The Golden Rule

Handling Spot Interruptions

Checkpointing for Training

AWS Spot Interruption Handler

GCP Preemptible Shutdown Script

Budget Strategies

Strategy 1: Spot Fleet (AWS)

Strategy 2: Time-of-Day Optimization

Strategy 3: Budget Alerts

Strategy 4: Reserved/Committed Use Discounts

Strategy 5: The Hybrid Stack

Cost Monitoring Tools

Practice Lab

Pakistan Case Study

Key Takeaways

Lesson Summary

Quiz: Spot Instances, Preemptible VMs & Budget Strategies