GPU Scheduling & Resource Management

GPUs are the most expensive resource in any AI infrastructure. An NVIDIA A100 costs $2-3/hour on cloud. If your Kubernetes cluster has 4 GPUs and 10 AI models competing for them, who gets what? Poor scheduling means GPUs sit idle while pods queue, or one greedy model starves everything else. This lesson teaches you to manage GPU resources efficiently in containerized environments.

The GPU Scheduling Problem

code

Without GPU scheduling:
├── Model A grabs GPU 0 and GPU 1 (needs only 1)
├── Model B grabs GPU 2 (correct)
├── Model C → no GPU available → waiting...
├── Model D → no GPU available → waiting...
└── GPU 1 is 90% idle (wasted $2/hour)

With proper scheduling:
├── Model A → GPU 0 (limit: 1 GPU)
├── Model B → GPU 1 (limit: 1 GPU)
├── Model C → GPU 2 (limit: 1 GPU)
├── Model D → GPU 3 (limit: 1 GPU)
└── All GPUs utilized, all models served

NVIDIA Device Plugin for Kubernetes

The NVIDIA device plugin lets Kubernetes see and schedule GPUs:

bash

# Install NVIDIA device plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

# Verify GPUs are detected
kubectl describe nodes | grep nvidia.com/gpu
# Output: nvidia.com/gpu: 4  (shows 4 GPUs available)

Requesting GPUs in Pod Specs

yaml

spec:
  containers:
    - name: llm-inference
      image: my-llm-api:v1
      resources:
        requests:
          nvidia.com/gpu: 1    # Request 1 GPU
        limits:
          nvidia.com/gpu: 1    # Limit to 1 GPU

Rules:

GPU requests and limits must be equal (no overcommit)
GPUs are whole numbers only (can't request 0.5 GPU)
A pod with nvidia.com/gpu: 1 gets exclusive access to one full GPU

GPU Resource Planning

Sizing Your Models

Model Size	VRAM Needed	GPU Required	PKR Cloud Cost/hour
7B (Q4)	4-6 GB	RTX 3060 / T4	PKR 150-250
7B (FP16)	14 GB	RTX 4090 / A10G	PKR 300-500
13B (Q4)	8-10 GB	RTX 4070 / T4	PKR 200-350
70B (Q4)	36-40 GB	A100 40GB	PKR 700-1,000
70B (FP16)	140 GB	2x A100 80GB	PKR 1,500-2,000

The VRAM Budget

code

Total GPU VRAM: 24 GB (RTX 4090)
├── Model weights: 14 GB (7B FP16)
├── KV cache: 4 GB (for context window)
├── Activation memory: 2 GB (during inference)
├── OS/driver overhead: 1 GB
└── Available: 3 GB buffer

Rule: Keep 10-15% VRAM free as buffer

Multi-GPU Strategies

Strategy 1: One Model Per GPU

Simplest approach — each model gets its own GPU:

yaml

# Model A on GPU 0
spec:
  nodeSelector:
    gpu-type: "a100"
  containers:
    - name: model-a
      resources:
        limits:
          nvidia.com/gpu: 1
---
# Model B on GPU 1
spec:
  containers:
    - name: model-b
      resources:
        limits:
          nvidia.com/gpu: 1

Strategy 2: GPU Sharing (MIG / MPS)

NVIDIA Multi-Instance GPU (MIG) splits one A100 into smaller slices:

code

A100 80GB with MIG:
├── Slice 1: 20GB → Small model API
├── Slice 2: 20GB → Embedding service
├── Slice 3: 20GB → Image classifier
└── Slice 4: 20GB → Development/testing

MIG setup:

bash

# Enable MIG on A100
sudo nvidia-smi -i 0 -mig 1

# Create 4 equal slices
sudo nvidia-smi mig -i 0 -cgi 9,9,9,9 -C

# Verify
nvidia-smi mig -lgip

Strategy 3: Time-Sharing (NVIDIA MPS)

Multiple processes share one GPU by time-slicing:

bash

# Enable MPS
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log
nvidia-cuda-mps-control -d

When to use which:

Strategy	Best For	Isolation	Efficiency
One-per-GPU	Production inference	Full	Good
MIG	Multiple small models on A100	Hardware-level	Excellent
MPS	Dev/test environments	Process-level	Good
No sharing	Training (needs full GPU)	Full	Varies

Node Affinity & Taints

Labeling GPU Nodes

bash

# Label nodes by GPU type
kubectl label nodes gpu-node-1 gpu-type=a100
kubectl label nodes gpu-node-2 gpu-type=t4
kubectl label nodes cpu-node-1 role=cpu-only

Scheduling Models to Specific GPUs

yaml

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: gpu-type
                operator: In
                values:
                  - a100

This ensures your 70B model only runs on A100 nodes, not T4 nodes where it would OOM.

Taints: Reserving GPU Nodes

bash

# Taint GPU nodes — only GPU workloads can use them
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule

# CPU-only pods won't be scheduled on GPU nodes
# GPU pods need a toleration:

yaml

spec:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Equal"
      value: "present"
      effect: "NoSchedule"

Monitoring GPU Utilization

DCGM Exporter + Prometheus

bash

# Deploy DCGM exporter for GPU metrics
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter

# Key metrics to watch:
# DCGM_FI_DEV_GPU_UTIL — GPU utilization %
# DCGM_FI_DEV_FB_USED — VRAM used (bytes)
# DCGM_FI_DEV_GPU_TEMP — Temperature
# DCGM_FI_DEV_POWER_USAGE — Power draw (watts)

Quick CLI Monitoring

bash

# Inside a GPU pod
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv -l 5

# Output every 5 seconds:
# utilization.gpu [%], memory.used [MiB], memory.total [MiB], temperature.gpu
# 78 %, 12400 MiB, 24576 MiB, 72

Practice Lab

Task 1: GPU Node Setup In your Kubernetes cluster, install the NVIDIA device plugin. Label your GPU nodes by GPU type. Verify that kubectl describe nodes shows GPU resources.

Task 2: Multi-Model Deployment Deploy 2 different models on a multi-GPU node. Use resource limits to ensure each model gets exactly 1 GPU. Verify with nvidia-smi that both GPUs are allocated.

Task 3: Monitoring Dashboard Set up DCGM exporter and query GPU metrics. Create a simple dashboard showing GPU utilization, VRAM usage, and temperature for each GPU in your cluster.

Pakistan Case Study

Meet Asim — infrastructure lead at a Karachi fintech running 3 AI models: fraud detection, credit scoring, and Urdu NLP chatbot.

His problem: 2 A100 GPUs (PKR 200,000/month cloud bill). All 3 models competing for GPU time. Fraud detection (latency-critical) was getting delayed by the chatbot (batch-tolerant).

His solution:

GPU 0: Fraud detection (dedicated, tainted — nothing else runs here)
GPU 1 with MIG: Credit scoring (40GB slice) + Chatbot (40GB slice)
Node affinity ensures fraud model never shares GPU
Priority classes: fraud > credit > chatbot

Results:

Fraud detection latency: 200ms → 45ms (dedicated GPU)
All 3 models running simultaneously (was previously sequential)
GPU utilization: 35% → 82% (stopped paying for idle compute)
Monthly savings: PKR 60,000 (removed the 3rd GPU they thought they needed)

Key Takeaways

GPUs are the most expensive resource — scheduling them well saves serious money
NVIDIA device plugin makes GPUs visible to Kubernetes scheduler
GPU requests and limits must be equal — no fractional or overcommitted GPUs
MIG splits A100s into isolated slices for multi-model serving
Node affinity and taints ensure the right model runs on the right GPU
Monitor GPU utilization with DCGM — target 70-85% for cost efficiency

Next lesson: FastAPI for model serving — building production AI endpoints.

6.3 — GPU Scheduling & Resource Management

GPU Scheduling & Resource Management

The GPU Scheduling Problem

NVIDIA Device Plugin for Kubernetes

Requesting GPUs in Pod Specs

GPU Resource Planning

Sizing Your Models

The VRAM Budget

Multi-GPU Strategies

Strategy 1: One Model Per GPU

Strategy 2: GPU Sharing (MIG / MPS)

Strategy 3: Time-Sharing (NVIDIA MPS)

Node Affinity & Taints

Labeling GPU Nodes

Scheduling Models to Specific GPUs

Taints: Reserving GPU Nodes

Monitoring GPU Utilization

DCGM Exporter + Prometheus

Quick CLI Monitoring

Practice Lab

Pakistan Case Study

Key Takeaways

Lesson Summary

Quiz: GPU Scheduling & Resource Management