Jagadesh - AI/ML Infrastructure Expert

Building Production-Ready AI/ML Infrastructure on Kubernetes

Introduction: The GPU Orchestration Challenge

As organizations race to deploy AI workloads, they’re discovering that traditional Kubernetes knowledge isn’t enough. GPU scheduling, model serving at scale, and distributed training require specialized infrastructure expertise that bridges the gap between ML engineers and platform teams.

The challenges are multifaceted: GPU resources are expensive and often underutilized, model serving requires sub-second latency at scale, training jobs need fault tolerance for spot instances, and the entire platform must support rapid experimentation while maintaining production stability.

After implementing ML platforms for multiple enterprises, from financial services running fraud detection models to healthcare organizations processing medical imaging, I’ve developed battle-tested patterns for production ML infrastructure.

GPU Resource Management in Kubernetes

Understanding GPU Scheduling

Unlike CPU and memory, GPUs in Kubernetes are treated as extended resources that cannot be oversubscribed. This fundamental difference requires careful planning:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:11.8.0-base-ubuntu22.04
    resources:
      limits:
        nvidia.com/gpu: 1 # requesting 1 GPU

The NVIDIA GPU Operator Approach

The NVIDIA GPU Operator automates the management of all NVIDIA software components needed for GPU provisioning:

# Install GPU Operator via Helm
helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set migManager.enabled=true \
  --set dcgmExporter.enabled=true

Multi-Instance GPU (MIG) Configuration

MIG allows you to partition A100 and A30 GPUs into smaller instances, perfect for inference workloads:

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.5gb:
        - devices: [0,1,2,3]
          mig-enabled: true
          mig-devices:
            1g.5gb: 7

This configuration creates 7 MIG instances per GPU, each with 5GB of memory - perfect for serving smaller models.

Dynamic GPU Allocation with Node Labels

I implement sophisticated node labeling for workload placement:

// Go operator code for dynamic GPU labeling
func (r *GPUNodeReconciler) labelGPUNodes(ctx context.Context) error {
    nodes := &corev1.NodeList{}
    if err := r.List(ctx, nodes); err != nil {
        return err
    }
    
    for _, node := range nodes.Items {
        gpuCount := node.Status.Capacity["nvidia.com/gpu"]
        if gpuCount.IsZero() {
            continue
        }
        
        // Add labels based on GPU type
        labels := map[string]string{
            "gpu.nvidia.com/class": detectGPUClass(&node),
            "gpu.nvidia.com/memory": detectGPUMemory(&node),
            "workload.gpu/type": categorizeWorkload(&node),
        }
        
        node.Labels = labels
        if err := r.Update(ctx, &node); err != nil {
            return err
        }
    }
    return nil
}

Model Serving at Scale with KServe

KServe Architecture

KServe provides a serverless framework for model serving with features like autoscaling, canary rollouts, and multi-model serving:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detection-model
spec:
  predictor:
    minReplicas: 2
    maxReplicas: 10
    scaleTarget: 80 # GPU utilization target
    scaleMetric: gpu
    containers:
    - name: kserve-container
      image: fraud-model:v2.1
      resources:
        requests:
          nvidia.com/gpu: 1
          memory: 8Gi
        limits:
          nvidia.com/gpu: 1
          memory: 8Gi
      env:
      - name: STORAGE_URI
        value: gs://models/fraud-detection/v2.1
      - name: MODEL_NAME
        value: fraud_detection
    gpu:
      runtimeVersion: 23.08-py3
      storageUri: gs://model-store/fraud-detection
  transformer:
    containers:
    - name: transformer
      image: custom-transformer:latest
      env:
      - name: FEATURE_STORE_URI
        value: redis://feature-store:6379

Implementing A/B Testing for Models

A/B testing is crucial for safe model rollouts:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: recommendation-model
spec:
  predictor:
    canaryTrafficPercent: 20
    containers:
    - name: model-a
      image: rec-model:v1.0
    canary:
      containers:
      - name: model-b
        image: rec-model:v2.0

Optimizing Inference Performance

For high-throughput inference, I implement batching and caching strategies:

// Custom Go transformer for request batching
type BatchTransformer struct {
    batchSize     int
    batchTimeout  time.Duration
    requestQueue  chan *Request
    modelEndpoint string
}

func (bt *BatchTransformer) Process(ctx context.Context) {
    batch := make([]*Request, 0, bt.batchSize)
    timer := time.NewTimer(bt.batchTimeout)
    
    for {
        select {
        case req := <-bt.requestQueue:
            batch = append(batch, req)
            if len(batch) >= bt.batchSize {
                bt.sendBatch(batch)
                batch = batch[:0]
                timer.Reset(bt.batchTimeout)
            }
        case <-timer.C:
            if len(batch) > 0 {
                bt.sendBatch(batch)
                batch = batch[:0]
            }
            timer.Reset(bt.batchTimeout)
        case <-ctx.Done():
            return
        }
    }
}

Distributed Training with Kubeflow

Setting Up Kubeflow Pipelines

Kubeflow Pipelines provide a platform for building and deploying ML workflows:

# pipeline.py - Distributed training pipeline
import kfp
from kfp import dsl
from kfp.v2 import compiler

@dsl.component(
    base_image='pytorch/pytorch:2.0.0-cuda11.8-cudnn8-runtime',
    packages_to_install=['torch-distributed', 'transformers']
)
def distributed_training_op(
    model_name: str,
    dataset_path: str,
    num_epochs: int,
    num_gpus: int
) -> str:
    import torch
    import torch.distributed as dist
    from torch.nn.parallel import DistributedDataParallel
    
    # Initialize distributed training
    dist.init_process_group(backend='nccl')
    local_rank = int(os.environ['LOCAL_RANK'])
    torch.cuda.set_device(local_rank)
    
    # Model setup with DDP
    model = load_model(model_name)
    model = model.cuda(local_rank)
    model = DistributedDataParallel(model, device_ids=[local_rank])
    
    # Training loop
    for epoch in range(num_epochs):
        train_epoch(model, dataset_path, epoch)
        if local_rank == 0:
            save_checkpoint(model, epoch)
    
    return f"gs://models/{model_name}/final"

@dsl.pipeline(
    name='Distributed Training Pipeline',
    description='Multi-GPU distributed training pipeline'
)
def training_pipeline(
    model_name: str = 'bert-large',
    dataset_path: str = 'gs://datasets/custom',
    num_epochs: int = 10
):
    # Distributed training with 4 GPUs
    training_task = distributed_training_op(
        model_name=model_name,
        dataset_path=dataset_path,
        num_epochs=num_epochs,
        num_gpus=4
    ).set_gpu_limit(4).set_memory_limit('32G')

PyTorchJob for Distributed Training

For more control over distributed training, I use PyTorchJob:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: llm-finetuning
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: llm-training:latest
            imagePullPolicy: Always
            resources:
              limits:
                nvidia.com/gpu: 8
              requests:
                nvidia.com/gpu: 8
                memory: 64Gi
            command:
            - python
            - train.py
            - --model=llama2-7b
            - --distributed
            - --checkpoint-freq=1000
            volumeMounts:
            - name: checkpoint-storage
              mountPath: /checkpoints
          volumes:
          - name: checkpoint-storage
            persistentVolumeClaim:
              claimName: training-checkpoints
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: llm-training:latest
            resources:
              limits:
                nvidia.com/gpu: 8
              requests:
                nvidia.com/gpu: 8
                memory: 64Gi

Cost Optimization Strategies

Spot GPU Instance Management

I’ve developed a comprehensive spot instance strategy that reduces costs by 60-70%:

apiVersion: v1
kind: ConfigMap
metadata:
  name: spot-handler-config
data:
  handler.sh: |
    #!/bin/bash
    # Spot instance termination handler
    
    TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
    
    while true; do
      # Check for spot termination notice
      HTTP_CODE=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -o /dev/null -w "%{http_code}" http://169.254.169.254/latest/meta-data/spot/instance-action)
      
      if [ "$HTTP_CODE" == "200" ]; then
        echo "Spot instance termination notice received"
        
        # Checkpoint current training state
        kubectl exec -n training $(kubectl get pods -n training -l job=current -o name) -- python -c "
        import torch
        import os
        checkpoint = {
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': loss,
        }
        torch.save(checkpoint, '/checkpoints/emergency_checkpoint.pt')
        os.system('gsutil cp /checkpoints/emergency_checkpoint.pt gs://checkpoints/spot-recovery/')
        "
        
        # Drain node
        kubectl drain $(hostname) --force --ignore-daemonsets
        
        break
      fi
      
      sleep 5
    done

Automatic Checkpointing System

My custom operator ensures no work is lost when using spot instances:

// Checkpoint operator in Go
type CheckpointReconciler struct {
    client.Client
    Scheme *runtime.Scheme
}

func (r *CheckpointReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    job := &batchv1.Job{}
    if err := r.Get(ctx, req.NamespacedName, job); err != nil {
        return ctrl.Result{}, err
    }
    
    // Check if job is running on spot instance
    if isSpotInstance(job) {
        // Set up periodic checkpointing
        cronJob := &batchv1.CronJob{
            ObjectMeta: metav1.ObjectMeta{
                Name:      fmt.Sprintf("%s-checkpoint", job.Name),
                Namespace: job.Namespace,
            },
            Spec: batchv1.CronJobSpec{
                Schedule: "*/10 * * * *", // Every 10 minutes
                JobTemplate: batchv1.JobTemplateSpec{
                    Spec: batchv1.JobSpec{
                        Template: corev1.PodTemplateSpec{
                            Spec: corev1.PodSpec{
                                Containers: []corev1.Container{{
                                    Name:  "checkpointer",
                                    Image: "checkpoint-saver:latest",
                                    Command: []string{
                                        "/bin/sh", "-c",
                                        "kubectl exec " + job.Name + " -- /checkpoint.sh",
                                    },
                                }},
                            },
                        },
                    },
                },
            },
        }
        
        if err := r.Create(ctx, cronJob); err != nil {
            return ctrl.Result{}, err
        }
    }
    
    return ctrl.Result{RequeueAfter: time.Minute}, nil
}

GPU Time-Slicing for Development

For development workloads, I implement GPU time-slicing to maximize utilization:

apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-sharing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Create 4 virtual GPUs per physical GPU

Monitoring and Observability

GPU Metrics with DCGM

I deploy NVIDIA DCGM for comprehensive GPU monitoring:

apiVersion: v1
kind: Service
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  type: ClusterIP
  ports:
  - name: metrics
    port: 9400
    targetPort: 9400
    protocol: TCP
  selector:
    app: nvidia-dcgm-exporter

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  endpoints:
  - interval: 30s
    path: /metrics
    port: metrics
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter

Custom Grafana Dashboard for ML Workloads

{
  "dashboard": {
    "title": "ML Infrastructure Overview",
    "panels": [
      {
        "title": "GPU Utilization by Model",
        "targets": [
          {
            "expr": "avg by (model_name) (DCGM_FI_DEV_GPU_UTIL{job=\"dcgm-exporter\"})"
          }
        ]
      },
      {
        "title": "Inference Latency P99",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, rate(inference_duration_seconds_bucket[5m]))"
          }
        ]
      },
      {
        "title": "Training Loss Over Time",
        "targets": [
          {
            "expr": "training_loss{job=\"kubeflow-metrics\"}"
          }
        ]
      },
      {
        "title": "GPU Memory Usage",
        "targets": [
          {
            "expr": "DCGM_FI_DEV_MEM_COPY_UTIL{job=\"dcgm-exporter\"}"
          }
        ]
      }
    ]
  }
}

Production Best Practices

1. Resource Quotas for ML Namespaces

apiVersion: v1
kind: ResourceQuota
metadata:
  name: ml-team-quota
  namespace: ml-production
spec:
  hard:
    requests.nvidia.com/gpu: 10
    requests.memory: 500Gi
    persistentvolumeclaims: 20
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values: ["high", "medium"]

2. Priority Classes for ML Workloads

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-inference
value: 1000
globalDefault: false
description: "Priority for production inference workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: training-experiments
value: 100
globalDefault: false
description: "Priority for training experiments"

3. Network Policies for Model Security

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: model-serving-network-policy
  namespace: ml-production
spec:
  podSelector:
    matchLabels:
      app: model-serving
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: api-gateway
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: feature-store
    ports:
    - protocol: TCP
      port: 6379

Case Study: Financial Services ML Platform

Challenge

A major financial institution needed to deploy 50+ fraud detection models with: - Sub-100ms latency requirements - 99.99% availability SLA - Cost constraints of $500K annual budget - Compliance with financial regulations

Solution Architecture

# Multi-region deployment for HA
apiVersion: v1
kind: Namespace
metadata:
  name: fraud-detection-prod
  labels:
    compliance: pci-dss
    environment: production
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-model-ensemble
spec:
  predictor:
    minReplicas: 3
    maxReplicas: 20
    scaleTarget: 70
    containers:
    - name: model-server
      image: fraud-ensemble:v3.2
      resources:
        requests:
          nvidia.com/gpu: 1
          memory: 16Gi
      env:
      - name: ENABLE_BATCHING
        value: "true"
      - name: MAX_BATCH_SIZE
        value: "32"
      - name: BATCH_TIMEOUT_MS
        value: "10"

Results

Latency: Achieved 45ms P99 latency (55% better than requirement)
Availability: 99.995% uptime over 6 months
Cost: $380K annual run rate (24% under budget)
Scale: Processing 10M+ transactions daily

Conclusion

Building production-ready ML infrastructure on Kubernetes requires deep understanding of both Kubernetes primitives and ML-specific requirements. The patterns and practices I’ve shared here - from GPU orchestration and model serving to cost optimization and monitoring - have been battle-tested across multiple production deployments.

Key takeaways for successful ML infrastructure:

Start with GPU utilization - Measure and optimize from day one
Implement checkpointing early - Essential for cost-effective spot instance usage
Design for experimentation - ML teams need to iterate quickly
Monitor everything - GPU metrics, model performance, and costs
Automate operations - Use operators to handle routine tasks

The ML infrastructure space is evolving rapidly, but these foundational patterns provide a solid base for building scalable, cost-effective platforms that accelerate your organization’s AI initiatives.

Need help building your ML infrastructure? I offer consulting services for organizations looking to deploy production-grade ML platforms on Kubernetes. Schedule a consultation to discuss your specific requirements.