Organizations are spending millions on GPU infrastructure, yet most achieve only 20-40% utilization. With A100 GPUs costing $10,000+ and H100s reaching $30,000+, this inefficiency translates to massive waste. In one recent engagement, I discovered a client was effectively burning $2.3M annually on idle GPU cycles.
The root causes are multifaceted: poor scheduling, inefficient batching, memory fragmentation, and lack of visibility into actual usage. After optimizing GPU utilization for dozens of ML platforms, I’ve developed systematic approaches that consistently achieve 85%+ utilization without sacrificing performance.
This article shares the exact techniques, monitoring strategies, and architectural patterns I use to maximize GPU efficiency in production Kubernetes environments.
GPU utilization isn’t a single metric. Effective optimization requires understanding multiple dimensions:
# GPU metrics collection script
import pynvml
import time
import json
def collect_gpu_metrics():
pynvml.nvmlInit()= pynvml.nvmlDeviceGetCount()
device_count
= []
metrics for i in range(device_count):
= pynvml.nvmlDeviceGetHandleByIndex(i)
handle
# Compute utilization
= pynvml.nvmlDeviceGetUtilizationRates(handle)
util
# Memory usage
= pynvml.nvmlDeviceGetMemoryInfo(handle)
mem_info
# Power usage
= pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0
power
# Temperature
= pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
temp
# Process info
= pynvml.nvmlDeviceGetComputeRunningProcesses(handle)
processes
metrics.append({'gpu_id': i,
'compute_utilization': util.gpu,
'memory_utilization': util.memory,
'memory_used_gb': mem_info.used / 1024**3,
'memory_total_gb': mem_info.total / 1024**3,
'power_watts': power,
'temperature_c': temp,
'process_count': len(processes)
})
return metrics
Before optimization, I establish comprehensive baselines:
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-monitoring-config
data:
prometheus-config.yaml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'dcgm'
static_configs:
- targets: ['dcgm-exporter:9400']
- job_name: 'custom-gpu-metrics'
static_configs: - targets: ['gpu-metrics-collector:8080']
MIG is transformative for inference workloads but requires careful planning:
#!/bin/bash
# MIG configuration script for A100
# Enable MIG mode
nvidia-smi -i 0 -mig 1
# Create GPU instances (3g.20gb profile)
nvidia-smi mig -i 0 -cgi 9,9 -c
# Verify configuration
nvidia-smi -L
nvidia-smi mig -lgi
I’ve developed a controller that adjusts MIG profiles based on workload:
// MIG profile controller
package main
import (
"context"
"fmt"
"os/exec"
"k8s.io/api/core/v1"
corev1 "sigs.k8s.io/controller-runtime/pkg/client"
)
type MIGController struct {
.Client
client}
func (m *MIGController) OptimizeMIGProfiles(ctx context.Context) error {
// Get all GPU nodes
:= &corev1.NodeList{}
nodes if err := m.List(ctx, nodes, client.MatchingLabels{"node.kubernetes.io/gpu": "true"}); err != nil {
return err
}
for _, node := range nodes.Items {
:= m.analyzeWorkloads(ctx, node.Name)
workloadType
switch workloadType {
case "inference-heavy":
// Configure for many small instances
.configureMIG(node.Name, "1g.5gb", 7)
mcase "training-heavy":
// Configure for fewer large instances
.configureMIG(node.Name, "3g.20gb", 2)
mcase "mixed":
// Balanced configuration
.configureMIG(node.Name, "2g.10gb", 3)
m}
}
return nil
}
func (m *MIGController) configureMIG(nodeName, profile string, count int) error {
// SSH to node and reconfigure MIG
:= exec.Command("ssh", nodeName, fmt.Sprintf(
cmd "nvidia-smi mig -dci && nvidia-smi mig -dgi && nvidia-smi mig -cgi %s,%d",
, count,
profile))
return cmd.Run()
}
Ensuring pods land on appropriate MIG instances:
apiVersion: v1
kind: Pod
metadata:
name: inference-pod
spec:
containers:
- name: model-server
resources:
limits:
nvidia.com/mig-1g.5gb: 1
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
For development and light inference workloads:
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
- name: nvidia.com/mig-1g.5gb replicas: 2
My custom scheduler extension for time-sliced GPUs:
// Time-slicing aware scheduler
func (s *GPUScheduler) Filter(ctx context.Context, pod *v1.Pod, node *v1.Node) bool {
// Check if pod is suitable for time-slicing
if !s.isTimeSliceCandidate(pod) {
return s.dedicatedGPUFilter(pod, node)
}
// Check current time-slice allocation
:= s.getUsedTimeSlices(node)
currentSlices := s.getMaxTimeSlices(node)
maxSlices
if currentSlices >= maxSlices {
return false
}
// Check for workload compatibility
:= s.getNodeWorkloads(node)
existingWorkloads for _, workload := range existingWorkloads {
if !s.areWorkloadsCompatible(pod, workload) {
return false
}
}
return true
}
func (s *GPUScheduler) isTimeSliceCandidate(pod *v1.Pod) bool {
// Notebooks, development, light inference
:= pod.GetAnnotations()
annotations := annotations["workload.gpu/type"]
workloadType
return workloadType == "notebook" ||
== "development" ||
workloadType == "light-inference"
workloadType }
Implementing intelligent batching significantly improves throughput:
# Dynamic batching service
import asyncio
from typing import List, Dict, Any
import numpy as np
import time
class DynamicBatcher:
def __init__(self,
int = 32,
max_batch_size: int = 50,
max_latency_ms: int = 1):
min_batch_size: self.max_batch_size = max_batch_size
self.max_latency_ms = max_latency_ms
self.min_batch_size = min_batch_size
self.pending_requests = []
self.lock = asyncio.Lock()
async def add_request(self, request: Dict[str, Any]) -> Any:
"""Add request to batch and wait for result"""
= asyncio.Future()
future
async with self.lock:
self.pending_requests.append({
'request': request,
'future': future,
'timestamp': time.time()
})
# Check if we should process immediately
if len(self.pending_requests) >= self.max_batch_size:
await self._process_batch()
return await future
async def _process_batch(self):
"""Process accumulated requests as a batch"""
if not self.pending_requests:
return
# Extract requests
= self.pending_requests[:self.max_batch_size]
batch self.pending_requests = self.pending_requests[self.max_batch_size:]
# Combine inputs
= np.stack([r['request']['input'] for r in batch])
inputs
# Run inference
= await self._run_inference(inputs)
outputs
# Distribute results
for i, item in enumerate(batch):
'future'].set_result(outputs[i])
item[
async def _batch_timeout_handler(self):
"""Process batches based on timeout"""
while True:
await asyncio.sleep(self.max_latency_ms / 1000.0)
async with self.lock:
= time.time()
now = [
ready_requests for r in self.pending_requests
r if (now - r['timestamp']) * 1000 >= self.max_latency_ms
]
if ready_requests:
await self._process_batch()
Finding optimal batch sizes for different models:
def find_optimal_batch_size(model, input_shape, gpu_memory_gb=16):
"""Determine optimal batch size for model"""
= [1, 2, 4, 8, 16, 32, 64, 128, 256]
batch_sizes = []
results
for batch_size in batch_sizes:
try:
# Test batch
= torch.randn(batch_size, *input_shape).cuda()
dummy_input
# Measure throughput
= time.time()
start for _ in range(100):
with torch.no_grad():
= model(dummy_input)
_
torch.cuda.synchronize()
= (100 * batch_size) / (time.time() - start)
throughput
# Measure memory
= torch.cuda.max_memory_allocated() / 1024**3
memory_used
results.append({'batch_size': batch_size,
'throughput': throughput,
'memory_gb': memory_used,
'efficiency': throughput / memory_used
})
except RuntimeError as e:
if "out of memory" in str(e):
break
raise
# Find best efficiency
= max(results, key=lambda x: x['efficiency'])
best return best['batch_size']
# Memory pool management
import torch
import gc
class GPUMemoryManager:
def __init__(self, reserved_memory_gb=2):
self.reserved_memory_gb = reserved_memory_gb
self.memory_pool = {}
def allocate_tensor(self, shape, dtype=torch.float32):
"""Allocate tensor from pool if possible"""
= (shape, dtype)
key
if key in self.memory_pool and self.memory_pool[key]:
# Reuse existing tensor
= self.memory_pool[key].pop()
tensor
tensor.zero_()return tensor
else:
# Allocate new tensor
return torch.zeros(shape, dtype=dtype, device='cuda')
def release_tensor(self, tensor):
"""Return tensor to pool"""
= (tuple(tensor.shape), tensor.dtype)
key
if key not in self.memory_pool:
self.memory_pool[key] = []
self.memory_pool[key].append(tensor)
def cleanup(self):
"""Periodic cleanup of unused tensors"""
for key in list(self.memory_pool.keys()):
if len(self.memory_pool[key]) > 10:
# Keep only 10 tensors of each size
self.memory_pool[key] = self.memory_pool[key][:10]
# Force garbage collection
gc.collect() torch.cuda.empty_cache()
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: memory-efficient
value: 500
preemptionPolicy: PreemptLowerPriority
description: "For memory-optimized GPU workloads"
---
apiVersion: v1
kind: Pod
metadata:
name: efficient-training
annotations:
scheduler.alpha.kubernetes.io/preferred-gpu-memory: "24Gi"
spec:
priorityClassName: memory-efficient
containers:
- name: training
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
memory: 32Gi
For inference workloads, I implement several strategies:
# TensorRT optimization for inference
import tensorrt as trt
import pycuda.driver as cuda
class TensorRTOptimizer:
def __init__(self, onnx_model_path, precision='fp16'):
self.logger = trt.Logger(trt.Logger.WARNING)
self.builder = trt.Builder(self.logger)
self.config = self.builder.create_builder_config()
# Set precision
if precision == 'fp16':
self.config.set_flag(trt.BuilderFlag.FP16)
elif precision == 'int8':
self.config.set_flag(trt.BuilderFlag.INT8)
# Set memory pool limit
self.config.max_workspace_size = 1 << 30 # 1GB
def optimize_model(self, onnx_model_path):
"""Convert ONNX model to TensorRT"""
= self.builder.create_network(
network 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)= trt.OnnxParser(network, self.logger)
parser
# Parse ONNX model
with open(onnx_model_path, 'rb') as f:
parser.parse(f.read())
# Build optimized engine
= self.builder.build_serialized_network(network, self.config)
plan = trt.Runtime(self.logger).deserialize_cuda_engine(plan)
engine
return engine
For training workloads, different strategies apply:
# Gradient accumulation for large batch training
class GradientAccumulationTrainer:
def __init__(self, model, optimizer, accumulation_steps=4):
self.model = model
self.optimizer = optimizer
self.accumulation_steps = accumulation_steps
def train_step(self, dataloader):
self.model.train()
= 0
accumulated_loss
for i, (inputs, targets) in enumerate(dataloader):
# Forward pass
= self.model(inputs)
outputs = self.criterion(outputs, targets)
loss
# Normalize loss by accumulation steps
= loss / self.accumulation_steps
loss
loss.backward()
+= loss.item()
accumulated_loss
# Update weights
if (i + 1) % self.accumulation_steps == 0:
self.optimizer.step()
self.optimizer.zero_grad()
return accumulated_loss
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
name: distributed-training
spec:
minMember: 4
queue: default
priorityClassName: high-priority
---
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training-job
spec:
parallelism: 4
template:
metadata:
annotations:
scheduling.k8s.io/group-name: distributed-training
spec:
schedulerName: volcano
containers:
- name: trainer
image: pytorch-training:latest
resources:
limits:
nvidia.com/gpu: 2
// Custom bin packing scheduler
type BinPackingScheduler struct {
.Client
client client}
func (b *BinPackingScheduler) Score(ctx context.Context, pod *v1.Pod, node *v1.Node) (int64, error) {
// Calculate current GPU utilization
:= b.getNodeGPUMetrics(node.Name)
nodeMetrics
// Calculate how well this pod would pack
:= b.getGPURequirement(pod)
requiredGPU := nodeMetrics.GPUUtilization
currentUtilization := currentUtilization + requiredGPU
newUtilization
// Score higher for better packing
if newUtilization > 0.7 && newUtilization < 0.95 {
return 100, nil // Optimal packing
} else if newUtilization < 0.7 {
return 50, nil // Under-utilized
} else {
return 10, nil // Over-subscribed risk
}
}
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-gpu-dashboard
data:
dashboard.json: |
{
"dashboard": {
"title": "GPU Utilization Deep Dive",
"panels": [
{
"title": "GPU Compute Utilization",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [{
"expr": "avg by (gpu, node) (DCGM_FI_DEV_GPU_UTIL)"
}]
},
{
"title": "GPU Memory Utilization",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [{
"expr": "DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE * 100"
}]
},
{
"title": "Underutilized GPUs",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [{
"expr": "DCGM_FI_DEV_GPU_UTIL < 30"
}]
},
{
"title": "GPU Power Efficiency",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
"targets": [{
"expr": "DCGM_FI_DEV_POWER_USAGE / DCGM_FI_DEV_GPU_UTIL"
}]
}
]
} }
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: gpu-utilization-alerts
spec:
groups:
- name: gpu.rules
interval: 30s
rules:
- alert: GPUUnderutilized
expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[5m]) < 30
for: 10m
annotations:
summary: "GPU {{ $labels.gpu }} underutilized"
description: "GPU utilization below 30% for 10 minutes"
- alert: GPUMemoryLeak
expr: rate(DCGM_FI_DEV_FB_USED[1h]) > 0.1
for: 30m
annotations:
summary: "Potential GPU memory leak"
description: "GPU memory usage increasing continuously"
- alert: GPUTemperatureHigh
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 5m
annotations:
summary: "GPU temperature critical"
description: "GPU {{ $labels.gpu }} temperature above 85°C"
A machine learning platform with: - 50 A100 GPUs across 10 nodes - Average utilization: 30% - Monthly cost: $125,000 - Primary workload: Model training and inference
# Essential metrics to track
- GPU compute utilization (target: >80%)
- GPU memory utilization (target: >70%)
- Power efficiency (performance per watt)
- Queue depth (pending GPU requests)
- Time to allocation (scheduling latency)
Achieving 85%+ GPU utilization requires a multi-faceted approach combining hardware features (MIG), software optimization (batching), and intelligent scheduling. The techniques I’ve outlined here have consistently delivered 2-3x improvements in GPU efficiency across diverse ML workloads.
The key is to start with comprehensive monitoring, understand your workload patterns, and apply the appropriate optimization strategies. With GPU costs continuing to rise, the ROI on utilization optimization often exceeds any other infrastructure investment.
Remember: every percentage point of improved GPU utilization directly impacts your bottom line. In the era of $30,000 H100s, the difference between 30% and 85% utilization is the difference between competitive advantage and bankruptcy.
Want to optimize your GPU infrastructure? I provide consulting services to help organizations maximize their GPU investments. Contact me to discuss your utilization challenges.