6.3 — GPU Scheduling & Resource Management
GPU Scheduling & Resource Management
GPUs are the most expensive resource in any AI infrastructure. An NVIDIA A100 costs $2-3/hour on cloud. If your Kubernetes cluster has 4 GPUs and 10 AI models competing for them, who gets what? Poor scheduling means GPUs sit idle while pods queue, or one greedy model starves everything else. This lesson teaches you to manage GPU resources efficiently in containerized environments.
The GPU Scheduling Problem
Without GPU scheduling:
├── Model A grabs GPU 0 and GPU 1 (needs only 1)
├── Model B grabs GPU 2 (correct)
├── Model C → no GPU available → waiting...
├── Model D → no GPU available → waiting...
└── GPU 1 is 90% idle (wasted $2/hour)
With proper scheduling:
├── Model A → GPU 0 (limit: 1 GPU)
├── Model B → GPU 1 (limit: 1 GPU)
├── Model C → GPU 2 (limit: 1 GPU)
├── Model D → GPU 3 (limit: 1 GPU)
└── All GPUs utilized, all models served
NVIDIA Device Plugin for Kubernetes
The NVIDIA device plugin lets Kubernetes see and schedule GPUs:
# Install NVIDIA device plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
# Verify GPUs are detected
kubectl describe nodes | grep nvidia.com/gpu
# Output: nvidia.com/gpu: 4 (shows 4 GPUs available)
Requesting GPUs in Pod Specs
spec:
containers:
- name: llm-inference
image: my-llm-api:v1
resources:
requests:
nvidia.com/gpu: 1 # Request 1 GPU
limits:
nvidia.com/gpu: 1 # Limit to 1 GPU
Rules:
- GPU requests and limits must be equal (no overcommit)
- GPUs are whole numbers only (can't request 0.5 GPU)
- A pod with
nvidia.com/gpu: 1gets exclusive access to one full GPU
GPU Resource Planning
Sizing Your Models
| Model Size | VRAM Needed | GPU Required | PKR Cloud Cost/hour |
|---|---|---|---|
| 7B (Q4) | 4-6 GB | RTX 3060 / T4 | PKR 150-250 |
| 7B (FP16) | 14 GB | RTX 4090 / A10G | PKR 300-500 |
| 13B (Q4) | 8-10 GB | RTX 4070 / T4 | PKR 200-350 |
| 70B (Q4) | 36-40 GB | A100 40GB | PKR 700-1,000 |
| 70B (FP16) | 140 GB | 2x A100 80GB | PKR 1,500-2,000 |
The VRAM Budget
Total GPU VRAM: 24 GB (RTX 4090)
├── Model weights: 14 GB (7B FP16)
├── KV cache: 4 GB (for context window)
├── Activation memory: 2 GB (during inference)
├── OS/driver overhead: 1 GB
└── Available: 3 GB buffer
Rule: Keep 10-15% VRAM free as buffer
Multi-GPU Strategies
Strategy 1: One Model Per GPU
Simplest approach — each model gets its own GPU:
# Model A on GPU 0
spec:
nodeSelector:
gpu-type: "a100"
containers:
- name: model-a
resources:
limits:
nvidia.com/gpu: 1
---
# Model B on GPU 1
spec:
containers:
- name: model-b
resources:
limits:
nvidia.com/gpu: 1
Strategy 2: GPU Sharing (MIG / MPS)
NVIDIA Multi-Instance GPU (MIG) splits one A100 into smaller slices:
A100 80GB with MIG:
├── Slice 1: 20GB → Small model API
├── Slice 2: 20GB → Embedding service
├── Slice 3: 20GB → Image classifier
└── Slice 4: 20GB → Development/testing
MIG setup:
# Enable MIG on A100
sudo nvidia-smi -i 0 -mig 1
# Create 4 equal slices
sudo nvidia-smi mig -i 0 -cgi 9,9,9,9 -C
# Verify
nvidia-smi mig -lgip
Strategy 3: Time-Sharing (NVIDIA MPS)
Multiple processes share one GPU by time-slicing:
# Enable MPS
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log
nvidia-cuda-mps-control -d
When to use which:
| Strategy | Best For | Isolation | Efficiency |
|---|---|---|---|
| One-per-GPU | Production inference | Full | Good |
| MIG | Multiple small models on A100 | Hardware-level | Excellent |
| MPS | Dev/test environments | Process-level | Good |
| No sharing | Training (needs full GPU) | Full | Varies |
Node Affinity & Taints
Labeling GPU Nodes
# Label nodes by GPU type
kubectl label nodes gpu-node-1 gpu-type=a100
kubectl label nodes gpu-node-2 gpu-type=t4
kubectl label nodes cpu-node-1 role=cpu-only
Scheduling Models to Specific GPUs
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu-type
operator: In
values:
- a100
This ensures your 70B model only runs on A100 nodes, not T4 nodes where it would OOM.
Taints: Reserving GPU Nodes
# Taint GPU nodes — only GPU workloads can use them
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule
# CPU-only pods won't be scheduled on GPU nodes
# GPU pods need a toleration:
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
Monitoring GPU Utilization
DCGM Exporter + Prometheus
# Deploy DCGM exporter for GPU metrics
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter
# Key metrics to watch:
# DCGM_FI_DEV_GPU_UTIL — GPU utilization %
# DCGM_FI_DEV_FB_USED — VRAM used (bytes)
# DCGM_FI_DEV_GPU_TEMP — Temperature
# DCGM_FI_DEV_POWER_USAGE — Power draw (watts)
Quick CLI Monitoring
# Inside a GPU pod
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv -l 5
# Output every 5 seconds:
# utilization.gpu [%], memory.used [MiB], memory.total [MiB], temperature.gpu
# 78 %, 12400 MiB, 24576 MiB, 72
Practice Lab
Task 1: GPU Node Setup
In your Kubernetes cluster, install the NVIDIA device plugin. Label your GPU nodes by GPU type. Verify that kubectl describe nodes shows GPU resources.
Task 2: Multi-Model Deployment
Deploy 2 different models on a multi-GPU node. Use resource limits to ensure each model gets exactly 1 GPU. Verify with nvidia-smi that both GPUs are allocated.
Task 3: Monitoring Dashboard Set up DCGM exporter and query GPU metrics. Create a simple dashboard showing GPU utilization, VRAM usage, and temperature for each GPU in your cluster.
Pakistan Case Study
Meet Asim — infrastructure lead at a Karachi fintech running 3 AI models: fraud detection, credit scoring, and Urdu NLP chatbot.
His problem: 2 A100 GPUs (PKR 200,000/month cloud bill). All 3 models competing for GPU time. Fraud detection (latency-critical) was getting delayed by the chatbot (batch-tolerant).
His solution:
- GPU 0: Fraud detection (dedicated, tainted — nothing else runs here)
- GPU 1 with MIG: Credit scoring (40GB slice) + Chatbot (40GB slice)
- Node affinity ensures fraud model never shares GPU
- Priority classes: fraud > credit > chatbot
Results:
- Fraud detection latency: 200ms → 45ms (dedicated GPU)
- All 3 models running simultaneously (was previously sequential)
- GPU utilization: 35% → 82% (stopped paying for idle compute)
- Monthly savings: PKR 60,000 (removed the 3rd GPU they thought they needed)
Key Takeaways
- GPUs are the most expensive resource — scheduling them well saves serious money
- NVIDIA device plugin makes GPUs visible to Kubernetes scheduler
- GPU requests and limits must be equal — no fractional or overcommitted GPUs
- MIG splits A100s into isolated slices for multi-model serving
- Node affinity and taints ensure the right model runs on the right GPU
- Monitor GPU utilization with DCGM — target 70-85% for cost efficiency
Next lesson: FastAPI for model serving — building production AI endpoints.
Lesson Summary
Quiz: GPU Scheduling & Resource Management
4 questions to test your understanding. Score 60% or higher to pass.