Jagadesh - AI/ML Infrastructure Expert

From 30% to 85%: Optimizing GPU Utilization in Kubernetes

Introduction: The GPU Utilization Crisis

Organizations are spending millions on GPU infrastructure, yet most achieve only 20-40% utilization. With A100 GPUs costing $10,000+ and H100s reaching $30,000+, this inefficiency translates to massive waste. In one recent engagement, I discovered a client was effectively burning $2.3M annually on idle GPU cycles.

The root causes are multifaceted: poor scheduling, inefficient batching, memory fragmentation, and lack of visibility into actual usage. After optimizing GPU utilization for dozens of ML platforms, I’ve developed systematic approaches that consistently achieve 85%+ utilization without sacrificing performance.

This article shares the exact techniques, monitoring strategies, and architectural patterns I use to maximize GPU efficiency in production Kubernetes environments.

Understanding GPU Utilization Metrics

What We’re Actually Measuring

GPU utilization isn’t a single metric. Effective optimization requires understanding multiple dimensions:

# GPU metrics collection script
import pynvml
import time
import json

def collect_gpu_metrics():
    pynvml.nvmlInit()
    device_count = pynvml.nvmlDeviceGetCount()
    
    metrics = []
    for i in range(device_count):
        handle = pynvml.nvmlDeviceGetHandleByIndex(i)
        
        # Compute utilization
        util = pynvml.nvmlDeviceGetUtilizationRates(handle)
        
        # Memory usage
        mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
        
        # Power usage
        power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0
        
        # Temperature
        temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
        
        # Process info
        processes = pynvml.nvmlDeviceGetComputeRunningProcesses(handle)
        
        metrics.append({
            'gpu_id': i,
            'compute_utilization': util.gpu,
            'memory_utilization': util.memory,
            'memory_used_gb': mem_info.used / 1024**3,
            'memory_total_gb': mem_info.total / 1024**3,
            'power_watts': power,
            'temperature_c': temp,
            'process_count': len(processes)
        })
    
    return metrics

Baseline Assessment

Before optimization, I establish comprehensive baselines:

apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-monitoring-config
data:
  prometheus-config.yaml: |
    global:
      scrape_interval: 15s
    scrape_configs:
    - job_name: 'dcgm'
      static_configs:
      - targets: ['dcgm-exporter:9400']
    - job_name: 'custom-gpu-metrics'
      static_configs:
      - targets: ['gpu-metrics-collector:8080']

Multi-Instance GPU (MIG) Configuration

When MIG Makes Sense

MIG is transformative for inference workloads but requires careful planning:

#!/bin/bash
# MIG configuration script for A100

# Enable MIG mode
nvidia-smi -i 0 -mig 1

# Create GPU instances (3g.20gb profile)
nvidia-smi mig -i 0 -cgi 9,9 -c

# Verify configuration
nvidia-smi -L
nvidia-smi mig -lgi

Dynamic MIG Profiles

I’ve developed a controller that adjusts MIG profiles based on workload:

// MIG profile controller
package main

import (
    "context"
    "fmt"
    "os/exec"
    
    corev1 "k8s.io/api/core/v1"
    "sigs.k8s.io/controller-runtime/pkg/client"
)

type MIGController struct {
    client.Client
}

func (m *MIGController) OptimizeMIGProfiles(ctx context.Context) error {
    // Get all GPU nodes
    nodes := &corev1.NodeList{}
    if err := m.List(ctx, nodes, client.MatchingLabels{"node.kubernetes.io/gpu": "true"}); err != nil {
        return err
    }
    
    for _, node := range nodes.Items {
        workloadType := m.analyzeWorkloads(ctx, node.Name)
        
        switch workloadType {
        case "inference-heavy":
            // Configure for many small instances
            m.configureMIG(node.Name, "1g.5gb", 7)
        case "training-heavy":
            // Configure for fewer large instances
            m.configureMIG(node.Name, "3g.20gb", 2)
        case "mixed":
            // Balanced configuration
            m.configureMIG(node.Name, "2g.10gb", 3)
        }
    }
    
    return nil
}

func (m *MIGController) configureMIG(nodeName, profile string, count int) error {
    // SSH to node and reconfigure MIG
    cmd := exec.Command("ssh", nodeName, fmt.Sprintf(
        "nvidia-smi mig -dci && nvidia-smi mig -dgi && nvidia-smi mig -cgi %s,%d",
        profile, count,
    ))
    
    return cmd.Run()
}

MIG-Aware Scheduling

Ensuring pods land on appropriate MIG instances:

apiVersion: v1
kind: Pod
metadata:
  name: inference-pod
spec:
  containers:
  - name: model-server
    resources:
      limits:
        nvidia.com/mig-1g.5gb: 1
  nodeSelector:
    nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb

GPU Time-Slicing Strategies

Configuring Time-Slicing

For development and light inference workloads:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4
        - name: nvidia.com/mig-1g.5gb
          replicas: 2

Intelligent Workload Placement

My custom scheduler extension for time-sliced GPUs:

// Time-slicing aware scheduler
func (s *GPUScheduler) Filter(ctx context.Context, pod *v1.Pod, node *v1.Node) bool {
    // Check if pod is suitable for time-slicing
    if !s.isTimeSliceCandidate(pod) {
        return s.dedicatedGPUFilter(pod, node)
    }
    
    // Check current time-slice allocation
    currentSlices := s.getUsedTimeSlices(node)
    maxSlices := s.getMaxTimeSlices(node)
    
    if currentSlices >= maxSlices {
        return false
    }
    
    // Check for workload compatibility
    existingWorkloads := s.getNodeWorkloads(node)
    for _, workload := range existingWorkloads {
        if !s.areWorkloadsCompatible(pod, workload) {
            return false
        }
    }
    
    return true
}

func (s *GPUScheduler) isTimeSliceCandidate(pod *v1.Pod) bool {
    // Notebooks, development, light inference
    annotations := pod.GetAnnotations()
    workloadType := annotations["workload.gpu/type"]
    
    return workloadType == "notebook" || 
           workloadType == "development" || 
           workloadType == "light-inference"
}

Batch Optimization Techniques

Dynamic Batching for Inference

Implementing intelligent batching significantly improves throughput:

# Dynamic batching service
import asyncio
from typing import List, Dict, Any
import numpy as np
import time

class DynamicBatcher:
    def __init__(self, 
                 max_batch_size: int = 32,
                 max_latency_ms: int = 50,
                 min_batch_size: int = 1):
        self.max_batch_size = max_batch_size
        self.max_latency_ms = max_latency_ms
        self.min_batch_size = min_batch_size
        self.pending_requests = []
        self.lock = asyncio.Lock()
        
    async def add_request(self, request: Dict[str, Any]) -> Any:
        """Add request to batch and wait for result"""
        future = asyncio.Future()
        
        async with self.lock:
            self.pending_requests.append({
                'request': request,
                'future': future,
                'timestamp': time.time()
            })
            
            # Check if we should process immediately
            if len(self.pending_requests) >= self.max_batch_size:
                await self._process_batch()
        
        return await future
    
    async def _process_batch(self):
        """Process accumulated requests as a batch"""
        if not self.pending_requests:
            return
        
        # Extract requests
        batch = self.pending_requests[:self.max_batch_size]
        self.pending_requests = self.pending_requests[self.max_batch_size:]
        
        # Combine inputs
        inputs = np.stack([r['request']['input'] for r in batch])
        
        # Run inference
        outputs = await self._run_inference(inputs)
        
        # Distribute results
        for i, item in enumerate(batch):
            item['future'].set_result(outputs[i])
    
    async def _batch_timeout_handler(self):
        """Process batches based on timeout"""
        while True:
            await asyncio.sleep(self.max_latency_ms / 1000.0)
            
            async with self.lock:
                now = time.time()
                ready_requests = [
                    r for r in self.pending_requests 
                    if (now - r['timestamp']) * 1000 >= self.max_latency_ms
                ]
                
                if ready_requests:
                    await self._process_batch()

Batch Size Optimization

Finding optimal batch sizes for different models:

def find_optimal_batch_size(model, input_shape, gpu_memory_gb=16):
    """Determine optimal batch size for model"""
    
    batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128, 256]
    results = []
    
    for batch_size in batch_sizes:
        try:
            # Test batch
            dummy_input = torch.randn(batch_size, *input_shape).cuda()
            
            # Measure throughput
            start = time.time()
            for _ in range(100):
                with torch.no_grad():
                    _ = model(dummy_input)
            torch.cuda.synchronize()
            
            throughput = (100 * batch_size) / (time.time() - start)
            
            # Measure memory
            memory_used = torch.cuda.max_memory_allocated() / 1024**3
            
            results.append({
                'batch_size': batch_size,
                'throughput': throughput,
                'memory_gb': memory_used,
                'efficiency': throughput / memory_used
            })
            
        except RuntimeError as e:
            if "out of memory" in str(e):
                break
            raise
    
    # Find best efficiency
    best = max(results, key=lambda x: x['efficiency'])
    return best['batch_size']

Memory Management Optimization

GPU Memory Fragmentation Prevention

# Memory pool management
import torch
import gc

class GPUMemoryManager:
    def __init__(self, reserved_memory_gb=2):
        self.reserved_memory_gb = reserved_memory_gb
        self.memory_pool = {}
        
    def allocate_tensor(self, shape, dtype=torch.float32):
        """Allocate tensor from pool if possible"""
        key = (shape, dtype)
        
        if key in self.memory_pool and self.memory_pool[key]:
            # Reuse existing tensor
            tensor = self.memory_pool[key].pop()
            tensor.zero_()
            return tensor
        else:
            # Allocate new tensor
            return torch.zeros(shape, dtype=dtype, device='cuda')
    
    def release_tensor(self, tensor):
        """Return tensor to pool"""
        key = (tuple(tensor.shape), tensor.dtype)
        
        if key not in self.memory_pool:
            self.memory_pool[key] = []
        
        self.memory_pool[key].append(tensor)
    
    def cleanup(self):
        """Periodic cleanup of unused tensors"""
        for key in list(self.memory_pool.keys()):
            if len(self.memory_pool[key]) > 10:
                # Keep only 10 tensors of each size
                self.memory_pool[key] = self.memory_pool[key][:10]
        
        # Force garbage collection
        gc.collect()
        torch.cuda.empty_cache()

Memory-Aware Pod Scheduling

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: memory-efficient
value: 500
preemptionPolicy: PreemptLowerPriority
description: "For memory-optimized GPU workloads"
---
apiVersion: v1
kind: Pod
metadata:
  name: efficient-training
  annotations:
    scheduler.alpha.kubernetes.io/preferred-gpu-memory: "24Gi"
spec:
  priorityClassName: memory-efficient
  containers:
  - name: training
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        nvidia.com/gpu: 1
        memory: 32Gi

Workload-Specific Optimization

Inference Optimization

For inference workloads, I implement several strategies:

# TensorRT optimization for inference
import tensorrt as trt
import pycuda.driver as cuda

class TensorRTOptimizer:
    def __init__(self, onnx_model_path, precision='fp16'):
        self.logger = trt.Logger(trt.Logger.WARNING)
        self.builder = trt.Builder(self.logger)
        self.config = self.builder.create_builder_config()
        
        # Set precision
        if precision == 'fp16':
            self.config.set_flag(trt.BuilderFlag.FP16)
        elif precision == 'int8':
            self.config.set_flag(trt.BuilderFlag.INT8)
            
        # Set memory pool limit
        self.config.max_workspace_size = 1 << 30  # 1GB
        
    def optimize_model(self, onnx_model_path):
        """Convert ONNX model to TensorRT"""
        network = self.builder.create_network(
            1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
        )
        parser = trt.OnnxParser(network, self.logger)
        
        # Parse ONNX model
        with open(onnx_model_path, 'rb') as f:
            parser.parse(f.read())
        
        # Build optimized engine
        plan = self.builder.build_serialized_network(network, self.config)
        engine = trt.Runtime(self.logger).deserialize_cuda_engine(plan)
        
        return engine

Training Optimization

For training workloads, different strategies apply:

# Gradient accumulation for large batch training
class GradientAccumulationTrainer:
    def __init__(self, model, optimizer, accumulation_steps=4):
        self.model = model
        self.optimizer = optimizer
        self.accumulation_steps = accumulation_steps
        
    def train_step(self, dataloader):
        self.model.train()
        accumulated_loss = 0
        
        for i, (inputs, targets) in enumerate(dataloader):
            # Forward pass
            outputs = self.model(inputs)
            loss = self.criterion(outputs, targets)
            
            # Normalize loss by accumulation steps
            loss = loss / self.accumulation_steps
            loss.backward()
            
            accumulated_loss += loss.item()
            
            # Update weights
            if (i + 1) % self.accumulation_steps == 0:
                self.optimizer.step()
                self.optimizer.zero_grad()
        
        return accumulated_loss

Advanced Scheduling Strategies

Gang Scheduling for Distributed Training

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: distributed-training
spec:
  minMember: 4
  queue: default
  priorityClassName: high-priority
---
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training-job
spec:
  parallelism: 4
  template:
    metadata:
      annotations:
        scheduling.k8s.io/group-name: distributed-training
    spec:
      schedulerName: volcano
      containers:
      - name: trainer
        image: pytorch-training:latest
        resources:
          limits:
            nvidia.com/gpu: 2

Bin Packing Optimizer

// Custom bin packing scheduler
type BinPackingScheduler struct {
    client client.Client
}

func (b *BinPackingScheduler) Score(ctx context.Context, pod *v1.Pod, node *v1.Node) (int64, error) {
    // Calculate current GPU utilization
    nodeMetrics := b.getNodeGPUMetrics(node.Name)
    
    // Calculate how well this pod would pack
    requiredGPU := b.getGPURequirement(pod)
    currentUtilization := nodeMetrics.GPUUtilization
    newUtilization := currentUtilization + requiredGPU
    
    // Score higher for better packing
    if newUtilization > 0.7 && newUtilization < 0.95 {
        return 100, nil // Optimal packing
    } else if newUtilization < 0.7 {
        return 50, nil  // Under-utilized
    } else {
        return 10, nil  // Over-subscribed risk
    }
}

Monitoring and Alerting

Comprehensive GPU Dashboard

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-gpu-dashboard
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "GPU Utilization Deep Dive",
        "panels": [
          {
            "title": "GPU Compute Utilization",
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
            "targets": [{
              "expr": "avg by (gpu, node) (DCGM_FI_DEV_GPU_UTIL)"
            }]
          },
          {
            "title": "GPU Memory Utilization",
            "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
            "targets": [{
              "expr": "DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE * 100"
            }]
          },
          {
            "title": "Underutilized GPUs",
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
            "targets": [{
              "expr": "DCGM_FI_DEV_GPU_UTIL < 30"
            }]
          },
          {
            "title": "GPU Power Efficiency",
            "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
            "targets": [{
              "expr": "DCGM_FI_DEV_POWER_USAGE / DCGM_FI_DEV_GPU_UTIL"
            }]
          }
        ]
      }
    }

Utilization Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-utilization-alerts
spec:
  groups:
  - name: gpu.rules
    interval: 30s
    rules:
    - alert: GPUUnderutilized
      expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[5m]) < 30
      for: 10m
      annotations:
        summary: "GPU {{ $labels.gpu }} underutilized"
        description: "GPU utilization below 30% for 10 minutes"
    
    - alert: GPUMemoryLeak
      expr: rate(DCGM_FI_DEV_FB_USED[1h]) > 0.1
      for: 30m
      annotations:
        summary: "Potential GPU memory leak"
        description: "GPU memory usage increasing continuously"
    
    - alert: GPUTemperatureHigh
      expr: DCGM_FI_DEV_GPU_TEMP > 85
      for: 5m
      annotations:
        summary: "GPU temperature critical"
        description: "GPU {{ $labels.gpu }} temperature above 85°C"

Case Study: From 30% to 85% Utilization

Initial State

A machine learning platform with: - 50 A100 GPUs across 10 nodes - Average utilization: 30% - Monthly cost: $125,000 - Primary workload: Model training and inference

Optimization Steps

Week 1: Baseline and Monitoring
- Deployed comprehensive monitoring
- Identified that 60% of GPU time was idle between batches
- Found 15 GPUs dedicated to development but rarely used
Week 2: MIG Implementation
- Configured MIG on 20 A100s for inference
- Created 7x 1g.5gb instances per GPU
- Moved inference workloads to MIG instances
Week 3: Time-Slicing for Development
- Implemented 4:1 time-slicing on 10 GPUs
- Migrated development workloads
- Freed up 5 full GPUs
Week 4: Batch Optimization
- Implemented dynamic batching
- Optimized batch sizes per model
- Reduced inference latency by 40%
Week 5: Workload Scheduling
- Implemented gang scheduling for distributed training
- Added bin packing optimizer
- Consolidated workloads to fewer nodes

Results

Utilization: Increased from 30% to 85%
Effective capacity: Equivalent to adding 35 A100s
Cost savings: $87,500/month
Performance: 25% improvement in training throughput

Lessons Learned

MIG is transformative for inference workloads
Time-slicing works well for development but not production
Batch optimization provides immediate gains
Visibility is essential - you can’t optimize what you can’t measure
Gradual rollout minimizes disruption

Best Practices Summary

1. Start with Measurement

# Essential metrics to track
- GPU compute utilization (target: >80%)
- GPU memory utilization (target: >70%)
- Power efficiency (performance per watt)
- Queue depth (pending GPU requests)
- Time to allocation (scheduling latency)

2. Match Strategy to Workload

Inference: MIG + Dynamic batching
Training: Gang scheduling + Gradient accumulation
Development: Time-slicing + Preemptible priority
Mixed: Dedicated pools with overflow

3. Implement Gradually

Phase 1: Monitoring and baseline
Phase 2: Easy wins (batch optimization)
Phase 3: MIG for appropriate workloads
Phase 4: Advanced scheduling
Phase 5: Continuous optimization

4. Automate Everything

Use operators for configuration management
Implement auto-scaling based on metrics
Automated rebalancing during low-usage periods
Self-healing for common issues

Conclusion

Achieving 85%+ GPU utilization requires a multi-faceted approach combining hardware features (MIG), software optimization (batching), and intelligent scheduling. The techniques I’ve outlined here have consistently delivered 2-3x improvements in GPU efficiency across diverse ML workloads.

The key is to start with comprehensive monitoring, understand your workload patterns, and apply the appropriate optimization strategies. With GPU costs continuing to rise, the ROI on utilization optimization often exceeds any other infrastructure investment.

Remember: every percentage point of improved GPU utilization directly impacts your bottom line. In the era of $30,000 H100s, the difference between 30% and 85% utilization is the difference between competitive advantage and bankruptcy.

Want to optimize your GPU infrastructure? I provide consulting services to help organizations maximize their GPU investments. Contact me to discuss your utilization challenges.