Organizations are spending millions on GPU infrastructure, yet most achieve only 20-40% utilization. With A100 GPUs costing $10,000+ and H100s reaching $30,000+, this inefficiency translates to massive waste. In one recent engagement, I discovered a client was effectively burning $2.3M annually on idle GPU cycles.
The root causes are multifaceted: poor scheduling, inefficient batching, memory fragmentation, and lack of visibility into actual usage. After optimizing GPU utilization for dozens of ML platforms, I’ve developed systematic approaches that consistently achieve 85%+ utilization without sacrificing performance.
This article shares the exact techniques, monitoring strategies, and architectural patterns I use to maximize GPU efficiency in production Kubernetes environments.
GPU utilization isn’t a single metric. Effective optimization requires understanding multiple dimensions:
# GPU metrics collection script
import pynvml
import time
import json
def collect_gpu_metrics():
pynvml.nvmlInit()
device_count = pynvml.nvmlDeviceGetCount()
metrics = []
for i in range(device_count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
# Compute utilization
util = pynvml.nvmlDeviceGetUtilizationRates(handle)
# Memory usage
mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
# Power usage
power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0
# Temperature
temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
# Process info
processes = pynvml.nvmlDeviceGetComputeRunningProcesses(handle)
metrics.append({
'gpu_id': i,
'compute_utilization': util.gpu,
'memory_utilization': util.memory,
'memory_used_gb': mem_info.used / 1024**3,
'memory_total_gb': mem_info.total / 1024**3,
'power_watts': power,
'temperature_c': temp,
'process_count': len(processes)
})
return metricsBefore optimization, I establish comprehensive baselines:
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-monitoring-config
data:
prometheus-config.yaml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'dcgm'
static_configs:
- targets: ['dcgm-exporter:9400']
- job_name: 'custom-gpu-metrics'
static_configs:
- targets: ['gpu-metrics-collector:8080']MIG is transformative for inference workloads but requires careful planning:
#!/bin/bash
# MIG configuration script for A100
# Enable MIG mode
nvidia-smi -i 0 -mig 1
# Create GPU instances (3g.20gb profile)
nvidia-smi mig -i 0 -cgi 9,9 -c
# Verify configuration
nvidia-smi -L
nvidia-smi mig -lgiI’ve developed a controller that adjusts MIG profiles based on workload:
// MIG profile controller
package main
import (
"context"
"fmt"
"os/exec"
corev1 "k8s.io/api/core/v1"
"sigs.k8s.io/controller-runtime/pkg/client"
)
type MIGController struct {
client.Client
}
func (m *MIGController) OptimizeMIGProfiles(ctx context.Context) error {
// Get all GPU nodes
nodes := &corev1.NodeList{}
if err := m.List(ctx, nodes, client.MatchingLabels{"node.kubernetes.io/gpu": "true"}); err != nil {
return err
}
for _, node := range nodes.Items {
workloadType := m.analyzeWorkloads(ctx, node.Name)
switch workloadType {
case "inference-heavy":
// Configure for many small instances
m.configureMIG(node.Name, "1g.5gb", 7)
case "training-heavy":
// Configure for fewer large instances
m.configureMIG(node.Name, "3g.20gb", 2)
case "mixed":
// Balanced configuration
m.configureMIG(node.Name, "2g.10gb", 3)
}
}
return nil
}
func (m *MIGController) configureMIG(nodeName, profile string, count int) error {
// SSH to node and reconfigure MIG
cmd := exec.Command("ssh", nodeName, fmt.Sprintf(
"nvidia-smi mig -dci && nvidia-smi mig -dgi && nvidia-smi mig -cgi %s,%d",
profile, count,
))
return cmd.Run()
}Ensuring pods land on appropriate MIG instances:
apiVersion: v1
kind: Pod
metadata:
name: inference-pod
spec:
containers:
- name: model-server
resources:
limits:
nvidia.com/mig-1g.5gb: 1
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gbFor development and light inference workloads:
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
- name: nvidia.com/mig-1g.5gb
replicas: 2My custom scheduler extension for time-sliced GPUs:
// Time-slicing aware scheduler
func (s *GPUScheduler) Filter(ctx context.Context, pod *v1.Pod, node *v1.Node) bool {
// Check if pod is suitable for time-slicing
if !s.isTimeSliceCandidate(pod) {
return s.dedicatedGPUFilter(pod, node)
}
// Check current time-slice allocation
currentSlices := s.getUsedTimeSlices(node)
maxSlices := s.getMaxTimeSlices(node)
if currentSlices >= maxSlices {
return false
}
// Check for workload compatibility
existingWorkloads := s.getNodeWorkloads(node)
for _, workload := range existingWorkloads {
if !s.areWorkloadsCompatible(pod, workload) {
return false
}
}
return true
}
func (s *GPUScheduler) isTimeSliceCandidate(pod *v1.Pod) bool {
// Notebooks, development, light inference
annotations := pod.GetAnnotations()
workloadType := annotations["workload.gpu/type"]
return workloadType == "notebook" ||
workloadType == "development" ||
workloadType == "light-inference"
}Implementing intelligent batching significantly improves throughput:
# Dynamic batching service
import asyncio
from typing import List, Dict, Any
import numpy as np
import time
class DynamicBatcher:
def __init__(self,
max_batch_size: int = 32,
max_latency_ms: int = 50,
min_batch_size: int = 1):
self.max_batch_size = max_batch_size
self.max_latency_ms = max_latency_ms
self.min_batch_size = min_batch_size
self.pending_requests = []
self.lock = asyncio.Lock()
async def add_request(self, request: Dict[str, Any]) -> Any:
"""Add request to batch and wait for result"""
future = asyncio.Future()
async with self.lock:
self.pending_requests.append({
'request': request,
'future': future,
'timestamp': time.time()
})
# Check if we should process immediately
if len(self.pending_requests) >= self.max_batch_size:
await self._process_batch()
return await future
async def _process_batch(self):
"""Process accumulated requests as a batch"""
if not self.pending_requests:
return
# Extract requests
batch = self.pending_requests[:self.max_batch_size]
self.pending_requests = self.pending_requests[self.max_batch_size:]
# Combine inputs
inputs = np.stack([r['request']['input'] for r in batch])
# Run inference
outputs = await self._run_inference(inputs)
# Distribute results
for i, item in enumerate(batch):
item['future'].set_result(outputs[i])
async def _batch_timeout_handler(self):
"""Process batches based on timeout"""
while True:
await asyncio.sleep(self.max_latency_ms / 1000.0)
async with self.lock:
now = time.time()
ready_requests = [
r for r in self.pending_requests
if (now - r['timestamp']) * 1000 >= self.max_latency_ms
]
if ready_requests:
await self._process_batch()Finding optimal batch sizes for different models:
def find_optimal_batch_size(model, input_shape, gpu_memory_gb=16):
"""Determine optimal batch size for model"""
batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128, 256]
results = []
for batch_size in batch_sizes:
try:
# Test batch
dummy_input = torch.randn(batch_size, *input_shape).cuda()
# Measure throughput
start = time.time()
for _ in range(100):
with torch.no_grad():
_ = model(dummy_input)
torch.cuda.synchronize()
throughput = (100 * batch_size) / (time.time() - start)
# Measure memory
memory_used = torch.cuda.max_memory_allocated() / 1024**3
results.append({
'batch_size': batch_size,
'throughput': throughput,
'memory_gb': memory_used,
'efficiency': throughput / memory_used
})
except RuntimeError as e:
if "out of memory" in str(e):
break
raise
# Find best efficiency
best = max(results, key=lambda x: x['efficiency'])
return best['batch_size']# Memory pool management
import torch
import gc
class GPUMemoryManager:
def __init__(self, reserved_memory_gb=2):
self.reserved_memory_gb = reserved_memory_gb
self.memory_pool = {}
def allocate_tensor(self, shape, dtype=torch.float32):
"""Allocate tensor from pool if possible"""
key = (shape, dtype)
if key in self.memory_pool and self.memory_pool[key]:
# Reuse existing tensor
tensor = self.memory_pool[key].pop()
tensor.zero_()
return tensor
else:
# Allocate new tensor
return torch.zeros(shape, dtype=dtype, device='cuda')
def release_tensor(self, tensor):
"""Return tensor to pool"""
key = (tuple(tensor.shape), tensor.dtype)
if key not in self.memory_pool:
self.memory_pool[key] = []
self.memory_pool[key].append(tensor)
def cleanup(self):
"""Periodic cleanup of unused tensors"""
for key in list(self.memory_pool.keys()):
if len(self.memory_pool[key]) > 10:
# Keep only 10 tensors of each size
self.memory_pool[key] = self.memory_pool[key][:10]
# Force garbage collection
gc.collect()
torch.cuda.empty_cache()apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: memory-efficient
value: 500
preemptionPolicy: PreemptLowerPriority
description: "For memory-optimized GPU workloads"
---
apiVersion: v1
kind: Pod
metadata:
name: efficient-training
annotations:
scheduler.alpha.kubernetes.io/preferred-gpu-memory: "24Gi"
spec:
priorityClassName: memory-efficient
containers:
- name: training
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
memory: 32GiFor inference workloads, I implement several strategies:
# TensorRT optimization for inference
import tensorrt as trt
import pycuda.driver as cuda
class TensorRTOptimizer:
def __init__(self, onnx_model_path, precision='fp16'):
self.logger = trt.Logger(trt.Logger.WARNING)
self.builder = trt.Builder(self.logger)
self.config = self.builder.create_builder_config()
# Set precision
if precision == 'fp16':
self.config.set_flag(trt.BuilderFlag.FP16)
elif precision == 'int8':
self.config.set_flag(trt.BuilderFlag.INT8)
# Set memory pool limit
self.config.max_workspace_size = 1 << 30 # 1GB
def optimize_model(self, onnx_model_path):
"""Convert ONNX model to TensorRT"""
network = self.builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, self.logger)
# Parse ONNX model
with open(onnx_model_path, 'rb') as f:
parser.parse(f.read())
# Build optimized engine
plan = self.builder.build_serialized_network(network, self.config)
engine = trt.Runtime(self.logger).deserialize_cuda_engine(plan)
return engineFor training workloads, different strategies apply:
# Gradient accumulation for large batch training
class GradientAccumulationTrainer:
def __init__(self, model, optimizer, accumulation_steps=4):
self.model = model
self.optimizer = optimizer
self.accumulation_steps = accumulation_steps
def train_step(self, dataloader):
self.model.train()
accumulated_loss = 0
for i, (inputs, targets) in enumerate(dataloader):
# Forward pass
outputs = self.model(inputs)
loss = self.criterion(outputs, targets)
# Normalize loss by accumulation steps
loss = loss / self.accumulation_steps
loss.backward()
accumulated_loss += loss.item()
# Update weights
if (i + 1) % self.accumulation_steps == 0:
self.optimizer.step()
self.optimizer.zero_grad()
return accumulated_lossapiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
name: distributed-training
spec:
minMember: 4
queue: default
priorityClassName: high-priority
---
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training-job
spec:
parallelism: 4
template:
metadata:
annotations:
scheduling.k8s.io/group-name: distributed-training
spec:
schedulerName: volcano
containers:
- name: trainer
image: pytorch-training:latest
resources:
limits:
nvidia.com/gpu: 2// Custom bin packing scheduler
type BinPackingScheduler struct {
client client.Client
}
func (b *BinPackingScheduler) Score(ctx context.Context, pod *v1.Pod, node *v1.Node) (int64, error) {
// Calculate current GPU utilization
nodeMetrics := b.getNodeGPUMetrics(node.Name)
// Calculate how well this pod would pack
requiredGPU := b.getGPURequirement(pod)
currentUtilization := nodeMetrics.GPUUtilization
newUtilization := currentUtilization + requiredGPU
// Score higher for better packing
if newUtilization > 0.7 && newUtilization < 0.95 {
return 100, nil // Optimal packing
} else if newUtilization < 0.7 {
return 50, nil // Under-utilized
} else {
return 10, nil // Over-subscribed risk
}
}apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-gpu-dashboard
data:
dashboard.json: |
{
"dashboard": {
"title": "GPU Utilization Deep Dive",
"panels": [
{
"title": "GPU Compute Utilization",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [{
"expr": "avg by (gpu, node) (DCGM_FI_DEV_GPU_UTIL)"
}]
},
{
"title": "GPU Memory Utilization",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [{
"expr": "DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE * 100"
}]
},
{
"title": "Underutilized GPUs",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [{
"expr": "DCGM_FI_DEV_GPU_UTIL < 30"
}]
},
{
"title": "GPU Power Efficiency",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
"targets": [{
"expr": "DCGM_FI_DEV_POWER_USAGE / DCGM_FI_DEV_GPU_UTIL"
}]
}
]
}
}apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: gpu-utilization-alerts
spec:
groups:
- name: gpu.rules
interval: 30s
rules:
- alert: GPUUnderutilized
expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[5m]) < 30
for: 10m
annotations:
summary: "GPU {{ $labels.gpu }} underutilized"
description: "GPU utilization below 30% for 10 minutes"
- alert: GPUMemoryLeak
expr: rate(DCGM_FI_DEV_FB_USED[1h]) > 0.1
for: 30m
annotations:
summary: "Potential GPU memory leak"
description: "GPU memory usage increasing continuously"
- alert: GPUTemperatureHigh
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 5m
annotations:
summary: "GPU temperature critical"
description: "GPU {{ $labels.gpu }} temperature above 85°C"A machine learning platform with: - 50 A100 GPUs across 10 nodes - Average utilization: 30% - Monthly cost: $125,000 - Primary workload: Model training and inference
# Essential metrics to track
- GPU compute utilization (target: >80%)
- GPU memory utilization (target: >70%)
- Power efficiency (performance per watt)
- Queue depth (pending GPU requests)
- Time to allocation (scheduling latency)Achieving 85%+ GPU utilization requires a multi-faceted approach combining hardware features (MIG), software optimization (batching), and intelligent scheduling. The techniques I’ve outlined here have consistently delivered 2-3x improvements in GPU efficiency across diverse ML workloads.
The key is to start with comprehensive monitoring, understand your workload patterns, and apply the appropriate optimization strategies. With GPU costs continuing to rise, the ROI on utilization optimization often exceeds any other infrastructure investment.
Remember: every percentage point of improved GPU utilization directly impacts your bottom line. In the era of $30,000 H100s, the difference between 30% and 85% utilization is the difference between competitive advantage and bankruptcy.
Want to optimize your GPU infrastructure? I provide consulting services to help organizations maximize their GPU investments. Contact me to discuss your utilization challenges.