From 30% to 85%: Optimizing GPU Utilization in Kubernetes
Introduction: The GPU Utilization Crisis
Organizations are spending millions on GPU infrastructure, yet most achieve only 20-40% utilization. With A100 GPUs costing $10,000+ and H100s reaching $30,000+, this inefficiency translates to massive waste. In one recent engagement, I discovered a client was effectively burning $2.3M annually on idle GPU cycles.
The root causes are multifaceted: poor scheduling, inefficient batching, memory fragmentation, and lack of visibility into actual usage. After optimizing GPU utilization for dozens of ML platforms, I’ve developed systematic approaches that consistently achieve 85%+ utilization without sacrificing performance.
This article shares the exact techniques, monitoring strategies, and architectural patterns I use to maximize GPU efficiency in production Kubernetes environments.
Understanding GPU Utilization Metrics
What We’re Actually Measuring
GPU utilization isn’t a single metric. Effective optimization requires understanding multiple dimensions:
# GPU metrics collection script
import pynvml
import time
import json
def collect_gpu_metrics():
pynvml.nvmlInit()
device_count = pynvml.nvmlDeviceGetCount()
metrics = []
for i in range(device_count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
# Compute utilization
util = pynvml.nvmlDeviceGetUtilizationRates(handle)
# Memory usage
mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
# Power usage
power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0
# Temperature
temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
# Process info
processes = pynvml.nvmlDeviceGetComputeRunningProcesses(handle)
metrics.append({
'gpu_id': i,
'compute_utilization': util.gpu,
'memory_utilization': util.memory,
'memory_used_gb': mem_info.used / 1024**3,
'memory_total_gb': mem_info.total / 1024**3,
'power_watts': power,
'temperature_c': temp,
'process_count': len(processes)
})
return metricsBaseline Assessment
Before optimization, I establish comprehensive baselines:
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-monitoring-config
data:
prometheus-config.yaml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'dcgm'
static_configs:
- targets: ['dcgm-exporter:9400']
- job_name: 'custom-gpu-metrics'
static_configs:
- targets: ['gpu-metrics-collector:8080']Multi-Instance GPU (MIG) Configuration
When MIG Makes Sense
MIG is transformative for inference workloads but requires careful planning:
#!/bin/bash
# MIG configuration script for A100
# Enable MIG mode
nvidia-smi -i 0 -mig 1
# Create GPU instances (3g.20gb profile)
nvidia-smi mig -i 0 -cgi 9,9 -c
# Verify configuration
nvidia-smi -L
nvidia-smi mig -lgiDynamic MIG Profiles
I’ve developed a controller that adjusts MIG profiles based on workload:
// MIG profile controller
package main
import (
"context"
"fmt"
"os/exec"
corev1 "k8s.io/api/core/v1"
"sigs.k8s.io/controller-runtime/pkg/client"
)
type MIGController struct {
client.Client
}
func (m *MIGController) OptimizeMIGProfiles(ctx context.Context) error {
// Get all GPU nodes
nodes := &corev1.NodeList{}
if err := m.List(ctx, nodes, client.MatchingLabels{"node.kubernetes.io/gpu": "true"}); err != nil {
return err
}
for _, node := range nodes.Items {
workloadType := m.analyzeWorkloads(ctx, node.Name)
switch workloadType {
case "inference-heavy":
// Configure for many small instances
m.configureMIG(node.Name, "1g.5gb", 7)
case "training-heavy":
// Configure for fewer large instances
m.configureMIG(node.Name, "3g.20gb", 2)
case "mixed":
// Balanced configuration
m.configureMIG(node.Name, "2g.10gb", 3)
}
}
return nil
}
func (m *MIGController) configureMIG(nodeName, profile string, count int) error {
// SSH to node and reconfigure MIG
cmd := exec.Command("ssh", nodeName, fmt.Sprintf(
"nvidia-smi mig -dci && nvidia-smi mig -dgi && nvidia-smi mig -cgi %s,%d",
profile, count,
))
return cmd.Run()
}MIG-Aware Scheduling
Ensuring pods land on appropriate MIG instances:
apiVersion: v1
kind: Pod
metadata:
name: inference-pod
spec:
containers:
- name: model-server
resources:
limits:
nvidia.com/mig-1g.5gb: 1
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gbGPU Time-Slicing Strategies
Configuring Time-Slicing
For development and light inference workloads:
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
- name: nvidia.com/mig-1g.5gb
replicas: 2Intelligent Workload Placement
My custom scheduler extension for time-sliced GPUs:
// Time-slicing aware scheduler
func (s *GPUScheduler) Filter(ctx context.Context, pod *v1.Pod, node *v1.Node) bool {
// Check if pod is suitable for time-slicing
if !s.isTimeSliceCandidate(pod) {
return s.dedicatedGPUFilter(pod, node)
}
// Check current time-slice allocation
currentSlices := s.getUsedTimeSlices(node)
maxSlices := s.getMaxTimeSlices(node)
if currentSlices >= maxSlices {
return false
}
// Check for workload compatibility
existingWorkloads := s.getNodeWorkloads(node)
for _, workload := range existingWorkloads {
if !s.areWorkloadsCompatible(pod, workload) {
return false
}
}
return true
}
func (s *GPUScheduler) isTimeSliceCandidate(pod *v1.Pod) bool {
// Notebooks, development, light inference
annotations := pod.GetAnnotations()
workloadType := annotations["workload.gpu/type"]
return workloadType == "notebook" ||
workloadType == "development" ||
workloadType == "light-inference"
}Batch Optimization Techniques
Dynamic Batching for Inference
Implementing intelligent batching significantly improves throughput:
# Dynamic batching service
import asyncio
from typing import List, Dict, Any
import numpy as np
import time
class DynamicBatcher:
def __init__(self,
max_batch_size: int = 32,
max_latency_ms: int = 50,
min_batch_size: int = 1):
self.max_batch_size = max_batch_size
self.max_latency_ms = max_latency_ms
self.min_batch_size = min_batch_size
self.pending_requests = []
self.lock = asyncio.Lock()
async def add_request(self, request: Dict[str, Any]) -> Any:
"""Add request to batch and wait for result"""
future = asyncio.Future()
async with self.lock:
self.pending_requests.append({
'request': request,
'future': future,
'timestamp': time.time()
})
# Check if we should process immediately
if len(self.pending_requests) >= self.max_batch_size:
await self._process_batch()
return await future
async def _process_batch(self):
"""Process accumulated requests as a batch"""
if not self.pending_requests:
return
# Extract requests
batch = self.pending_requests[:self.max_batch_size]
self.pending_requests = self.pending_requests[self.max_batch_size:]
# Combine inputs
inputs = np.stack([r['request']['input'] for r in batch])
# Run inference
outputs = await self._run_inference(inputs)
# Distribute results
for i, item in enumerate(batch):
item['future'].set_result(outputs[i])
async def _batch_timeout_handler(self):
"""Process batches based on timeout"""
while True:
await asyncio.sleep(self.max_latency_ms / 1000.0)
async with self.lock:
now = time.time()
ready_requests = [
r for r in self.pending_requests
if (now - r['timestamp']) * 1000 >= self.max_latency_ms
]
if ready_requests:
await self._process_batch()Batch Size Optimization
Finding optimal batch sizes for different models:
def find_optimal_batch_size(model, input_shape, gpu_memory_gb=16):
"""Determine optimal batch size for model"""
batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128, 256]
results = []
for batch_size in batch_sizes:
try:
# Test batch
dummy_input = torch.randn(batch_size, *input_shape).cuda()
# Measure throughput
start = time.time()
for _ in range(100):
with torch.no_grad():
_ = model(dummy_input)
torch.cuda.synchronize()
throughput = (100 * batch_size) / (time.time() - start)
# Measure memory
memory_used = torch.cuda.max_memory_allocated() / 1024**3
results.append({
'batch_size': batch_size,
'throughput': throughput,
'memory_gb': memory_used,
'efficiency': throughput / memory_used
})
except RuntimeError as e:
if "out of memory" in str(e):
break
raise
# Find best efficiency
best = max(results, key=lambda x: x['efficiency'])
return best['batch_size']Memory Management Optimization
GPU Memory Fragmentation Prevention
# Memory pool management
import torch
import gc
class GPUMemoryManager:
def __init__(self, reserved_memory_gb=2):
self.reserved_memory_gb = reserved_memory_gb
self.memory_pool = {}
def allocate_tensor(self, shape, dtype=torch.float32):
"""Allocate tensor from pool if possible"""
key = (shape, dtype)
if key in self.memory_pool and self.memory_pool[key]:
# Reuse existing tensor
tensor = self.memory_pool[key].pop()
tensor.zero_()
return tensor
else:
# Allocate new tensor
return torch.zeros(shape, dtype=dtype, device='cuda')
def release_tensor(self, tensor):
"""Return tensor to pool"""
key = (tuple(tensor.shape), tensor.dtype)
if key not in self.memory_pool:
self.memory_pool[key] = []
self.memory_pool[key].append(tensor)
def cleanup(self):
"""Periodic cleanup of unused tensors"""
for key in list(self.memory_pool.keys()):
if len(self.memory_pool[key]) > 10:
# Keep only 10 tensors of each size
self.memory_pool[key] = self.memory_pool[key][:10]
# Force garbage collection
gc.collect()
torch.cuda.empty_cache()Memory-Aware Pod Scheduling
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: memory-efficient
value: 500
preemptionPolicy: PreemptLowerPriority
description: "For memory-optimized GPU workloads"
---
apiVersion: v1
kind: Pod
metadata:
name: efficient-training
annotations:
scheduler.alpha.kubernetes.io/preferred-gpu-memory: "24Gi"
spec:
priorityClassName: memory-efficient
containers:
- name: training
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
memory: 32GiWorkload-Specific Optimization
Inference Optimization
For inference workloads, I implement several strategies:
# TensorRT optimization for inference
import tensorrt as trt
import pycuda.driver as cuda
class TensorRTOptimizer:
def __init__(self, onnx_model_path, precision='fp16'):
self.logger = trt.Logger(trt.Logger.WARNING)
self.builder = trt.Builder(self.logger)
self.config = self.builder.create_builder_config()
# Set precision
if precision == 'fp16':
self.config.set_flag(trt.BuilderFlag.FP16)
elif precision == 'int8':
self.config.set_flag(trt.BuilderFlag.INT8)
# Set memory pool limit
self.config.max_workspace_size = 1 << 30 # 1GB
def optimize_model(self, onnx_model_path):
"""Convert ONNX model to TensorRT"""
network = self.builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, self.logger)
# Parse ONNX model
with open(onnx_model_path, 'rb') as f:
parser.parse(f.read())
# Build optimized engine
plan = self.builder.build_serialized_network(network, self.config)
engine = trt.Runtime(self.logger).deserialize_cuda_engine(plan)
return engineTraining Optimization
For training workloads, different strategies apply:
# Gradient accumulation for large batch training
class GradientAccumulationTrainer:
def __init__(self, model, optimizer, accumulation_steps=4):
self.model = model
self.optimizer = optimizer
self.accumulation_steps = accumulation_steps
def train_step(self, dataloader):
self.model.train()
accumulated_loss = 0
for i, (inputs, targets) in enumerate(dataloader):
# Forward pass
outputs = self.model(inputs)
loss = self.criterion(outputs, targets)
# Normalize loss by accumulation steps
loss = loss / self.accumulation_steps
loss.backward()
accumulated_loss += loss.item()
# Update weights
if (i + 1) % self.accumulation_steps == 0:
self.optimizer.step()
self.optimizer.zero_grad()
return accumulated_lossAdvanced Scheduling Strategies
Gang Scheduling for Distributed Training
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
name: distributed-training
spec:
minMember: 4
queue: default
priorityClassName: high-priority
---
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training-job
spec:
parallelism: 4
template:
metadata:
annotations:
scheduling.k8s.io/group-name: distributed-training
spec:
schedulerName: volcano
containers:
- name: trainer
image: pytorch-training:latest
resources:
limits:
nvidia.com/gpu: 2Bin Packing Optimizer
// Custom bin packing scheduler
type BinPackingScheduler struct {
client client.Client
}
func (b *BinPackingScheduler) Score(ctx context.Context, pod *v1.Pod, node *v1.Node) (int64, error) {
// Calculate current GPU utilization
nodeMetrics := b.getNodeGPUMetrics(node.Name)
// Calculate how well this pod would pack
requiredGPU := b.getGPURequirement(pod)
currentUtilization := nodeMetrics.GPUUtilization
newUtilization := currentUtilization + requiredGPU
// Score higher for better packing
if newUtilization > 0.7 && newUtilization < 0.95 {
return 100, nil // Optimal packing
} else if newUtilization < 0.7 {
return 50, nil // Under-utilized
} else {
return 10, nil // Over-subscribed risk
}
}Monitoring and Alerting
Comprehensive GPU Dashboard
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-gpu-dashboard
data:
dashboard.json: |
{
"dashboard": {
"title": "GPU Utilization Deep Dive",
"panels": [
{
"title": "GPU Compute Utilization",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [{
"expr": "avg by (gpu, node) (DCGM_FI_DEV_GPU_UTIL)"
}]
},
{
"title": "GPU Memory Utilization",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [{
"expr": "DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE * 100"
}]
},
{
"title": "Underutilized GPUs",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [{
"expr": "DCGM_FI_DEV_GPU_UTIL < 30"
}]
},
{
"title": "GPU Power Efficiency",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
"targets": [{
"expr": "DCGM_FI_DEV_POWER_USAGE / DCGM_FI_DEV_GPU_UTIL"
}]
}
]
}
}Utilization Alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: gpu-utilization-alerts
spec:
groups:
- name: gpu.rules
interval: 30s
rules:
- alert: GPUUnderutilized
expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[5m]) < 30
for: 10m
annotations:
summary: "GPU {{ $labels.gpu }} underutilized"
description: "GPU utilization below 30% for 10 minutes"
- alert: GPUMemoryLeak
expr: rate(DCGM_FI_DEV_FB_USED[1h]) > 0.1
for: 30m
annotations:
summary: "Potential GPU memory leak"
description: "GPU memory usage increasing continuously"
- alert: GPUTemperatureHigh
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 5m
annotations:
summary: "GPU temperature critical"
description: "GPU {{ $labels.gpu }} temperature above 85°C"Case Study: From 30% to 85% Utilization
Initial State
A machine learning platform with: - 50 A100 GPUs across 10 nodes - Average utilization: 30% - Monthly cost: $125,000 - Primary workload: Model training and inference
Optimization Steps
- Week 1: Baseline and Monitoring
- Deployed comprehensive monitoring
- Identified that 60% of GPU time was idle between batches
- Found 15 GPUs dedicated to development but rarely used
- Week 2: MIG Implementation
- Configured MIG on 20 A100s for inference
- Created 7x 1g.5gb instances per GPU
- Moved inference workloads to MIG instances
- Week 3: Time-Slicing for Development
- Implemented 4:1 time-slicing on 10 GPUs
- Migrated development workloads
- Freed up 5 full GPUs
- Week 4: Batch Optimization
- Implemented dynamic batching
- Optimized batch sizes per model
- Reduced inference latency by 40%
- Week 5: Workload Scheduling
- Implemented gang scheduling for distributed training
- Added bin packing optimizer
- Consolidated workloads to fewer nodes
Results
- Utilization: Increased from 30% to 85%
- Effective capacity: Equivalent to adding 35 A100s
- Cost savings: $87,500/month
- Performance: 25% improvement in training throughput
Lessons Learned
- MIG is transformative for inference workloads
- Time-slicing works well for development but not production
- Batch optimization provides immediate gains
- Visibility is essential - you can’t optimize what you can’t measure
- Gradual rollout minimizes disruption
Best Practices Summary
1. Start with Measurement
# Essential metrics to track
- GPU compute utilization (target: >80%)
- GPU memory utilization (target: >70%)
- Power efficiency (performance per watt)
- Queue depth (pending GPU requests)
- Time to allocation (scheduling latency)2. Match Strategy to Workload
- Inference: MIG + Dynamic batching
- Training: Gang scheduling + Gradient accumulation
- Development: Time-slicing + Preemptible priority
- Mixed: Dedicated pools with overflow
3. Implement Gradually
- Phase 1: Monitoring and baseline
- Phase 2: Easy wins (batch optimization)
- Phase 3: MIG for appropriate workloads
- Phase 4: Advanced scheduling
- Phase 5: Continuous optimization
4. Automate Everything
- Use operators for configuration management
- Implement auto-scaling based on metrics
- Automated rebalancing during low-usage periods
- Self-healing for common issues
Conclusion
Achieving 85%+ GPU utilization requires a multi-faceted approach combining hardware features (MIG), software optimization (batching), and intelligent scheduling. The techniques I’ve outlined here have consistently delivered 2-3x improvements in GPU efficiency across diverse ML workloads.
The key is to start with comprehensive monitoring, understand your workload patterns, and apply the appropriate optimization strategies. With GPU costs continuing to rise, the ROI on utilization optimization often exceeds any other infrastructure investment.
Remember: every percentage point of improved GPU utilization directly impacts your bottom line. In the era of $30,000 H100s, the difference between 30% and 85% utilization is the difference between competitive advantage and bankruptcy.
Want to optimize your GPU infrastructure? I provide consulting services to help organizations maximize their GPU investments. Contact me to discuss your utilization challenges.