AI Infrastructure & Local LLMsModule 6

6.3GPU Scheduling & Resource Management

25 min 14 code blocks Practice Lab Quiz (4Q)

GPU Scheduling & Resource Management

GPUs are the most expensive resource in any AI infrastructure. An NVIDIA A100 costs $2-3/hour on cloud. If your Kubernetes cluster has 4 GPUs and 10 AI models competing for them, who gets what? Poor scheduling means GPUs sit idle while pods queue, or one greedy model starves everything else. This lesson teaches you to manage GPU resources efficiently in containerized environments.

The GPU Scheduling Problem

code
Without GPU scheduling:
├── Model A grabs GPU 0 and GPU 1 (needs only 1)
├── Model B grabs GPU 2 (correct)
├── Model C → no GPU available → waiting...
├── Model D → no GPU available → waiting...
└── GPU 1 is 90% idle (wasted $2/hour)

With proper scheduling:
├── Model A → GPU 0 (limit: 1 GPU)
├── Model B → GPU 1 (limit: 1 GPU)
├── Model C → GPU 2 (limit: 1 GPU)
├── Model D → GPU 3 (limit: 1 GPU)
└── All GPUs utilized, all models served

NVIDIA Device Plugin for Kubernetes

The NVIDIA device plugin lets Kubernetes see and schedule GPUs:

bash
# Install NVIDIA device plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

# Verify GPUs are detected
kubectl describe nodes | grep nvidia.com/gpu
# Output: nvidia.com/gpu: 4  (shows 4 GPUs available)

Requesting GPUs in Pod Specs

yaml
spec:
  containers:
    - name: llm-inference
      image: my-llm-api:v1
      resources:
        requests:
          nvidia.com/gpu: 1    # Request 1 GPU
        limits:
          nvidia.com/gpu: 1    # Limit to 1 GPU

Rules:

  • GPU requests and limits must be equal (no overcommit)
  • GPUs are whole numbers only (can't request 0.5 GPU)
  • A pod with nvidia.com/gpu: 1 gets exclusive access to one full GPU

GPU Resource Planning

Sizing Your Models

Model SizeVRAM NeededGPU RequiredPKR Cloud Cost/hour
7B (Q4)4-6 GBRTX 3060 / T4PKR 150-250
7B (FP16)14 GBRTX 4090 / A10GPKR 300-500
13B (Q4)8-10 GBRTX 4070 / T4PKR 200-350
70B (Q4)36-40 GBA100 40GBPKR 700-1,000
70B (FP16)140 GB2x A100 80GBPKR 1,500-2,000

The VRAM Budget

code
Total GPU VRAM: 24 GB (RTX 4090)
├── Model weights: 14 GB (7B FP16)
├── KV cache: 4 GB (for context window)
├── Activation memory: 2 GB (during inference)
├── OS/driver overhead: 1 GB
└── Available: 3 GB buffer

Rule: Keep 10-15% VRAM free as buffer

Multi-GPU Strategies

Strategy 1: One Model Per GPU

Simplest approach — each model gets its own GPU:

yaml
# Model A on GPU 0
spec:
  nodeSelector:
    gpu-type: "a100"
  containers:
    - name: model-a
      resources:
        limits:
          nvidia.com/gpu: 1
---
# Model B on GPU 1
spec:
  containers:
    - name: model-b
      resources:
        limits:
          nvidia.com/gpu: 1

Strategy 2: GPU Sharing (MIG / MPS)

NVIDIA Multi-Instance GPU (MIG) splits one A100 into smaller slices:

code
A100 80GB with MIG:
├── Slice 1: 20GB → Small model API
├── Slice 2: 20GB → Embedding service
├── Slice 3: 20GB → Image classifier
└── Slice 4: 20GB → Development/testing

MIG setup:

bash
# Enable MIG on A100
sudo nvidia-smi -i 0 -mig 1

# Create 4 equal slices
sudo nvidia-smi mig -i 0 -cgi 9,9,9,9 -C

# Verify
nvidia-smi mig -lgip

Strategy 3: Time-Sharing (NVIDIA MPS)

Multiple processes share one GPU by time-slicing:

bash
# Enable MPS
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log
nvidia-cuda-mps-control -d

When to use which:

StrategyBest ForIsolationEfficiency
One-per-GPUProduction inferenceFullGood
MIGMultiple small models on A100Hardware-levelExcellent
MPSDev/test environmentsProcess-levelGood
No sharingTraining (needs full GPU)FullVaries

Node Affinity & Taints

Labeling GPU Nodes

bash
# Label nodes by GPU type
kubectl label nodes gpu-node-1 gpu-type=a100
kubectl label nodes gpu-node-2 gpu-type=t4
kubectl label nodes cpu-node-1 role=cpu-only

Scheduling Models to Specific GPUs

yaml
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: gpu-type
                operator: In
                values:
                  - a100

This ensures your 70B model only runs on A100 nodes, not T4 nodes where it would OOM.

Taints: Reserving GPU Nodes

bash
# Taint GPU nodes — only GPU workloads can use them
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule

# CPU-only pods won't be scheduled on GPU nodes
# GPU pods need a toleration:
yaml
spec:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Equal"
      value: "present"
      effect: "NoSchedule"

Monitoring GPU Utilization

DCGM Exporter + Prometheus

bash
# Deploy DCGM exporter for GPU metrics
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter

# Key metrics to watch:
# DCGM_FI_DEV_GPU_UTIL — GPU utilization %
# DCGM_FI_DEV_FB_USED — VRAM used (bytes)
# DCGM_FI_DEV_GPU_TEMP — Temperature
# DCGM_FI_DEV_POWER_USAGE — Power draw (watts)

Quick CLI Monitoring

bash
# Inside a GPU pod
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv -l 5

# Output every 5 seconds:
# utilization.gpu [%], memory.used [MiB], memory.total [MiB], temperature.gpu
# 78 %, 12400 MiB, 24576 MiB, 72
Practice Lab

Practice Lab

Task 1: GPU Node Setup In your Kubernetes cluster, install the NVIDIA device plugin. Label your GPU nodes by GPU type. Verify that kubectl describe nodes shows GPU resources.

Task 2: Multi-Model Deployment Deploy 2 different models on a multi-GPU node. Use resource limits to ensure each model gets exactly 1 GPU. Verify with nvidia-smi that both GPUs are allocated.

Task 3: Monitoring Dashboard Set up DCGM exporter and query GPU metrics. Create a simple dashboard showing GPU utilization, VRAM usage, and temperature for each GPU in your cluster.

Pakistan Case Study

Meet Asim — infrastructure lead at a Karachi fintech running 3 AI models: fraud detection, credit scoring, and Urdu NLP chatbot.

His problem: 2 A100 GPUs (PKR 200,000/month cloud bill). All 3 models competing for GPU time. Fraud detection (latency-critical) was getting delayed by the chatbot (batch-tolerant).

His solution:

  • GPU 0: Fraud detection (dedicated, tainted — nothing else runs here)
  • GPU 1 with MIG: Credit scoring (40GB slice) + Chatbot (40GB slice)
  • Node affinity ensures fraud model never shares GPU
  • Priority classes: fraud > credit > chatbot

Results:

  • Fraud detection latency: 200ms → 45ms (dedicated GPU)
  • All 3 models running simultaneously (was previously sequential)
  • GPU utilization: 35% → 82% (stopped paying for idle compute)
  • Monthly savings: PKR 60,000 (removed the 3rd GPU they thought they needed)

Key Takeaways

  • GPUs are the most expensive resource — scheduling them well saves serious money
  • NVIDIA device plugin makes GPUs visible to Kubernetes scheduler
  • GPU requests and limits must be equal — no fractional or overcommitted GPUs
  • MIG splits A100s into isolated slices for multi-model serving
  • Node affinity and taints ensure the right model runs on the right GPU
  • Monitor GPU utilization with DCGM — target 70-85% for cost efficiency

Next lesson: FastAPI for model serving — building production AI endpoints.

Lesson Summary

Includes hands-on practice lab14 runnable code examples4-question knowledge check below

Quiz: GPU Scheduling & Resource Management

4 questions to test your understanding. Score 60% or higher to pass.