8.2 — Spot Instances, Preemptible VMs & Budget Strategies
Spot Instances, Preemptible VMs & Budget Strategies
Cloud GPUs are expensive at on-demand rates. But cloud providers have a secret: unused capacity they'll sell at 60-90% discount — if you're willing to accept that your instance might be interrupted. These are called spot instances (AWS), preemptible VMs (GCP), and spot VMs (Azure). For AI workloads that can tolerate interruption, this is the single biggest cost-saving strategy.
How Spot/Preemptible Pricing Works
On-Demand: "I need a GPU now, guaranteed, no interruption."
→ Full price: $1.00/hour
Spot: "I'll take whatever GPU capacity you have left over."
→ Discounted: $0.20-0.40/hour (60-80% off)
→ Risk: Instance can be reclaimed with 2 min warning
Committed: "I'll pay monthly for 1-3 years, guaranteed."
→ Discounted: $0.60/hour (40% off)
→ Risk: You pay even if you don't use it
Spot Instance Pricing Comparison
AWS Spot Instances
| Instance | GPU | On-Demand/hr | Spot/hr | Savings | Monthly Spot |
|---|---|---|---|---|---|
| g4dn.xlarge | T4 | $0.526 | $0.158 | 70% | $114 (PKR 32,000) |
| g5.xlarge | A10G | $1.006 | $0.302 | 70% | $217 (PKR 61,000) |
| p3.2xlarge | V100 | $3.06 | $0.918 | 70% | $661 (PKR 185,000) |
GCP Preemptible/Spot VMs
| Instance | GPU | On-Demand/hr | Spot/hr | Savings | Monthly Spot |
|---|---|---|---|---|---|
| n1 + T4 | T4 | $0.35 | $0.11 | 69% | $79 (PKR 22,000) |
| g2-standard-4 | L4 | $0.74 | $0.22 | 70% | $158 (PKR 44,000) |
| a2-highgpu-1g | A100 | $3.67 | $1.10 | 70% | $792 (PKR 222,000) |
Which AI Workloads Can Use Spot?
| Workload | Spot Safe? | Why |
|---|---|---|
| Model training | Yes (with checkpointing) | Can resume from last checkpoint |
| Batch inference | Yes | Reprocess failed items |
| Fine-tuning | Yes (with checkpointing) | Same as training |
| Data preprocessing | Yes | Stateless, retryable |
| Dev/testing | Yes | No production impact |
| Real-time API | Risky | Interruption = downtime |
| Low-latency API | No | Need guaranteed availability |
The Golden Rule
Use spot for anything that can be checkpointed and resumed or retried without loss. Use on-demand or dedicated for anything that must never go down.
Handling Spot Interruptions
Checkpointing for Training
Save model state every N steps so interruption only loses a few minutes of work:
# PyTorch checkpoint saving
import torch
def save_checkpoint(model, optimizer, epoch, step, loss, path):
torch.save({
'epoch': epoch,
'step': step,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, path)
def load_checkpoint(model, optimizer, path):
checkpoint = torch.load(path)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
return checkpoint['epoch'], checkpoint['step']
# Save every 500 steps
for step, batch in enumerate(dataloader):
loss = train_step(model, batch)
if step % 500 == 0:
save_checkpoint(model, optimizer, epoch, step, loss,
f"checkpoints/step_{step}.pt")
AWS Spot Interruption Handler
AWS gives a 2-minute warning before terminating:
import requests
import signal
def check_spot_interruption():
"""Check if AWS is about to reclaim this instance."""
try:
r = requests.get(
"http://169.254.169.254/latest/meta-data/spot/instance-action",
timeout=1
)
if r.status_code == 200:
return True # Interruption coming!
except:
pass
return False
# In your training loop
for step, batch in enumerate(dataloader):
if step % 10 == 0 and check_spot_interruption():
print("SPOT INTERRUPTION DETECTED — saving checkpoint...")
save_checkpoint(model, optimizer, epoch, step, loss,
"checkpoints/emergency.pt")
break
loss = train_step(model, batch)
GCP Preemptible Shutdown Script
#!/bin/bash
# /etc/init.d/preemptible-shutdown.sh
# GCP calls this 30 seconds before termination
echo "Preemptible VM shutting down — saving state..."
python3 /app/save_emergency_checkpoint.py
aws s3 cp /checkpoints/emergency.pt s3://my-bucket/checkpoints/
echo "Checkpoint saved to S3."
Budget Strategies
Strategy 1: Spot Fleet (AWS)
Request multiple instance types — AWS picks the cheapest available:
{
"SpotFleetRequestConfig": {
"TargetCapacity": 4,
"LaunchSpecifications": [
{"InstanceType": "g4dn.xlarge"},
{"InstanceType": "g4dn.2xlarge"},
{"InstanceType": "g5.xlarge"}
],
"AllocationStrategy": "lowestPrice"
}
}
AWS finds the cheapest GPU instance available right now and provisions it.
Strategy 2: Time-of-Day Optimization
GPU spot prices fluctuate by time of day:
US business hours (9 AM - 5 PM EST): High demand → Higher spot prices
US night / weekend: Low demand → Lowest spot prices
Pakistan daytime = US nighttime = CHEAPEST GPU prices
Schedule training jobs to run during Pakistan daytime (US nighttime) for lowest spot costs.
Strategy 3: Budget Alerts
Never get surprised by a cloud bill:
# AWS Budget Alert (CLI)
aws budgets create-budget \
--account-id 123456789 \
--budget '{
"BudgetName": "GPU-Monthly",
"BudgetLimit": {"Amount": "200", "Unit": "USD"},
"TimeUnit": "MONTHLY",
"BudgetType": "COST"
}' \
--notifications-with-subscribers '[{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80
},
"Subscribers": [{
"SubscriptionType": "EMAIL",
"Address": "your@email.com"
}]
}]'
Strategy 4: Reserved/Committed Use Discounts
If you know you'll use a GPU for 1+ year:
| Provider | Commitment | Discount | Best For |
|---|---|---|---|
| AWS Reserved | 1 year | 40% | Predictable production |
| AWS Reserved | 3 year | 60% | Long-term projects |
| GCP Committed | 1 year | 37% | Steady workloads |
| GCP Committed | 3 year | 55% | Established products |
Strategy 5: The Hybrid Stack
Training (batch): Spot instances (70% off)
Fine-tuning (batch): Spot instances (70% off)
Production API: Hetzner dedicated (fixed PKR 40K/month)
Traffic spikes: Cloud on-demand (pay per burst)
Development: Local GPU (PKR 5K electricity)
Cost Monitoring Tools
| Tool | Cost | What It Does |
|---|---|---|
| AWS Cost Explorer | Free | Visualize AWS spending by service |
| GCP Billing Console | Free | Same for GCP |
| Infracost | Free (OSS) | Estimate cost of Terraform changes before deploying |
| Kubecost | Free tier | K8s cost monitoring per namespace/pod |
| Custom dashboard | Free | Prometheus + Grafana with billing metrics |
Practice Lab
Task 1: Spot vs. On-Demand Calculator Calculate the monthly cost difference between on-demand and spot for your specific workload on AWS and GCP. Include the cost of lost work from 2 spot interruptions per day (assuming 500-step checkpointing).
Task 2: Checkpointing System Implement checkpoint saving/loading for a PyTorch training loop. Add spot interruption detection (use a mock for local testing). Verify you can stop training and resume from the last checkpoint.
Task 3: Budget Alert Setup Set up a budget alert on AWS or GCP (free tier) that emails you when spending exceeds $50/month. Configure a second alert at 80% of your budget.
Pakistan Case Study
Meet Sana — a data scientist at a Karachi AI company training custom Urdu language models.
Her training cost problem:
- Fine-tuning a 13B model on AWS p3.2xlarge (V100): $3.06/hour
- Training takes ~72 hours = $220 per training run
- She runs 3-4 experiments per week = $660-880/week = PKR 185,000-246,000/week
Her spot instance strategy:
- Switched to spot instances: $0.92/hour (70% savings)
- Added checkpointing every 200 steps (saves to S3)
- Average of 1 spot interruption per training run (loses ~10 minutes of work)
- Trained during Pakistan daytime (US nighttime = lowest spot prices)
Results:
- Training cost per run: $220 → $66 (70% savings)
- Weekly training cost: $880 → $264 (PKR 74,000)
- Monthly savings: PKR 688,000
- The 10 minutes lost per interruption is trivial — checkpoint resume takes 2 minutes
- She now runs 2x more experiments with the same budget
Key Takeaways
- Spot/preemptible instances save 60-80% on GPU costs — the single biggest optimization
- Only use spot for interruptible workloads: training, fine-tuning, batch inference
- Always checkpoint during training — every 200-500 steps minimum
- Handle interruption gracefully: save checkpoint → upload to S3 → resume on new instance
- Pakistan timezone advantage: train during local daytime = US nighttime = cheapest spot prices
- Set budget alerts to avoid surprise bills — cloud spending grows silently
- The hybrid stack (spot for training, dedicated for production, local for dev) minimizes total cost
Next lesson: Building a cost-optimized AI pipeline from end to end.
Lesson Summary
Quiz: Spot Instances, Preemptible VMs & Budget Strategies
4 questions to test your understanding. Score 60% or higher to pass.