Building Production-Ready AI/ML Infrastructure on Kubernetes
Introduction: The GPU Orchestration Challenge
As organizations race to deploy AI workloads, they’re discovering that traditional Kubernetes knowledge isn’t enough. GPU scheduling, model serving at scale, and distributed training require specialized infrastructure expertise that bridges the gap between ML engineers and platform teams.
The challenges are multifaceted: GPU resources are expensive and often underutilized, model serving requires sub-second latency at scale, training jobs need fault tolerance for spot instances, and the entire platform must support rapid experimentation while maintaining production stability.
After implementing ML platforms for multiple enterprises, from financial services running fraud detection models to healthcare organizations processing medical imaging, I’ve developed battle-tested patterns for production ML infrastructure.
GPU Resource Management in Kubernetes
Understanding GPU Scheduling
Unlike CPU and memory, GPUs in Kubernetes are treated as extended resources that cannot be oversubscribed. This fundamental difference requires careful planning:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:11.8.0-base-ubuntu22.04
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPUThe NVIDIA GPU Operator Approach
The NVIDIA GPU Operator automates the management of all NVIDIA software components needed for GPU provisioning:
# Install GPU Operator via Helm
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set migManager.enabled=true \
--set dcgmExporter.enabled=trueMulti-Instance GPU (MIG) Configuration
MIG allows you to partition A100 and A30 GPUs into smaller instances, perfect for inference workloads:
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-parted-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-1g.5gb:
- devices: [0,1,2,3]
mig-enabled: true
mig-devices:
1g.5gb: 7This configuration creates 7 MIG instances per GPU, each with 5GB of memory - perfect for serving smaller models.
Dynamic GPU Allocation with Node Labels
I implement sophisticated node labeling for workload placement:
// Go operator code for dynamic GPU labeling
func (r *GPUNodeReconciler) labelGPUNodes(ctx context.Context) error {
nodes := &corev1.NodeList{}
if err := r.List(ctx, nodes); err != nil {
return err
}
for _, node := range nodes.Items {
gpuCount := node.Status.Capacity["nvidia.com/gpu"]
if gpuCount.IsZero() {
continue
}
// Add labels based on GPU type
labels := map[string]string{
"gpu.nvidia.com/class": detectGPUClass(&node),
"gpu.nvidia.com/memory": detectGPUMemory(&node),
"workload.gpu/type": categorizeWorkload(&node),
}
node.Labels = labels
if err := r.Update(ctx, &node); err != nil {
return err
}
}
return nil
}Model Serving at Scale with KServe
KServe Architecture
KServe provides a serverless framework for model serving with features like autoscaling, canary rollouts, and multi-model serving:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-detection-model
spec:
predictor:
minReplicas: 2
maxReplicas: 10
scaleTarget: 80 # GPU utilization target
scaleMetric: gpu
containers:
- name: kserve-container
image: fraud-model:v2.1
resources:
requests:
nvidia.com/gpu: 1
memory: 8Gi
limits:
nvidia.com/gpu: 1
memory: 8Gi
env:
- name: STORAGE_URI
value: gs://models/fraud-detection/v2.1
- name: MODEL_NAME
value: fraud_detection
gpu:
runtimeVersion: 23.08-py3
storageUri: gs://model-store/fraud-detection
transformer:
containers:
- name: transformer
image: custom-transformer:latest
env:
- name: FEATURE_STORE_URI
value: redis://feature-store:6379Implementing A/B Testing for Models
A/B testing is crucial for safe model rollouts:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: recommendation-model
spec:
predictor:
canaryTrafficPercent: 20
containers:
- name: model-a
image: rec-model:v1.0
canary:
containers:
- name: model-b
image: rec-model:v2.0Optimizing Inference Performance
For high-throughput inference, I implement batching and caching strategies:
// Custom Go transformer for request batching
type BatchTransformer struct {
batchSize int
batchTimeout time.Duration
requestQueue chan *Request
modelEndpoint string
}
func (bt *BatchTransformer) Process(ctx context.Context) {
batch := make([]*Request, 0, bt.batchSize)
timer := time.NewTimer(bt.batchTimeout)
for {
select {
case req := <-bt.requestQueue:
batch = append(batch, req)
if len(batch) >= bt.batchSize {
bt.sendBatch(batch)
batch = batch[:0]
timer.Reset(bt.batchTimeout)
}
case <-timer.C:
if len(batch) > 0 {
bt.sendBatch(batch)
batch = batch[:0]
}
timer.Reset(bt.batchTimeout)
case <-ctx.Done():
return
}
}
}Distributed Training with Kubeflow
Setting Up Kubeflow Pipelines
Kubeflow Pipelines provide a platform for building and deploying ML workflows:
# pipeline.py - Distributed training pipeline
import kfp
from kfp import dsl
from kfp.v2 import compiler
@dsl.component(
base_image='pytorch/pytorch:2.0.0-cuda11.8-cudnn8-runtime',
packages_to_install=['torch-distributed', 'transformers']
)
def distributed_training_op(
model_name: str,
dataset_path: str,
num_epochs: int,
num_gpus: int
) -> str:
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel
# Initialize distributed training
dist.init_process_group(backend='nccl')
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)
# Model setup with DDP
model = load_model(model_name)
model = model.cuda(local_rank)
model = DistributedDataParallel(model, device_ids=[local_rank])
# Training loop
for epoch in range(num_epochs):
train_epoch(model, dataset_path, epoch)
if local_rank == 0:
save_checkpoint(model, epoch)
return f"gs://models/{model_name}/final"
@dsl.pipeline(
name='Distributed Training Pipeline',
description='Multi-GPU distributed training pipeline'
)
def training_pipeline(
model_name: str = 'bert-large',
dataset_path: str = 'gs://datasets/custom',
num_epochs: int = 10
):
# Distributed training with 4 GPUs
training_task = distributed_training_op(
model_name=model_name,
dataset_path=dataset_path,
num_epochs=num_epochs,
num_gpus=4
).set_gpu_limit(4).set_memory_limit('32G')PyTorchJob for Distributed Training
For more control over distributed training, I use PyTorchJob:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: llm-finetuning
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: llm-training:latest
imagePullPolicy: Always
resources:
limits:
nvidia.com/gpu: 8
requests:
nvidia.com/gpu: 8
memory: 64Gi
command:
- python
- train.py
- --model=llama2-7b
- --distributed
- --checkpoint-freq=1000
volumeMounts:
- name: checkpoint-storage
mountPath: /checkpoints
volumes:
- name: checkpoint-storage
persistentVolumeClaim:
claimName: training-checkpoints
Worker:
replicas: 3
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: llm-training:latest
resources:
limits:
nvidia.com/gpu: 8
requests:
nvidia.com/gpu: 8
memory: 64GiCost Optimization Strategies
Spot GPU Instance Management
I’ve developed a comprehensive spot instance strategy that reduces costs by 60-70%:
apiVersion: v1
kind: ConfigMap
metadata:
name: spot-handler-config
data:
handler.sh: |
#!/bin/bash
# Spot instance termination handler
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
while true; do
# Check for spot termination notice
HTTP_CODE=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -o /dev/null -w "%{http_code}" http://169.254.169.254/latest/meta-data/spot/instance-action)
if [ "$HTTP_CODE" == "200" ]; then
echo "Spot instance termination notice received"
# Checkpoint current training state
kubectl exec -n training $(kubectl get pods -n training -l job=current -o name) -- python -c "
import torch
import os
checkpoint = {
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}
torch.save(checkpoint, '/checkpoints/emergency_checkpoint.pt')
os.system('gsutil cp /checkpoints/emergency_checkpoint.pt gs://checkpoints/spot-recovery/')
"
# Drain node
kubectl drain $(hostname) --force --ignore-daemonsets
break
fi
sleep 5
doneAutomatic Checkpointing System
My custom operator ensures no work is lost when using spot instances:
// Checkpoint operator in Go
type CheckpointReconciler struct {
client.Client
Scheme *runtime.Scheme
}
func (r *CheckpointReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
job := &batchv1.Job{}
if err := r.Get(ctx, req.NamespacedName, job); err != nil {
return ctrl.Result{}, err
}
// Check if job is running on spot instance
if isSpotInstance(job) {
// Set up periodic checkpointing
cronJob := &batchv1.CronJob{
ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("%s-checkpoint", job.Name),
Namespace: job.Namespace,
},
Spec: batchv1.CronJobSpec{
Schedule: "*/10 * * * *", // Every 10 minutes
JobTemplate: batchv1.JobTemplateSpec{
Spec: batchv1.JobSpec{
Template: corev1.PodTemplateSpec{
Spec: corev1.PodSpec{
Containers: []corev1.Container{{
Name: "checkpointer",
Image: "checkpoint-saver:latest",
Command: []string{
"/bin/sh", "-c",
"kubectl exec " + job.Name + " -- /checkpoint.sh",
},
}},
},
},
},
},
},
}
if err := r.Create(ctx, cronJob); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{RequeueAfter: time.Minute}, nil
}GPU Time-Slicing for Development
For development workloads, I implement GPU time-slicing to maximize utilization:
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-sharing-config
namespace: gpu-operator
data:
any: |-
version: v1
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4 # Create 4 virtual GPUs per physical GPUMonitoring and Observability
GPU Metrics with DCGM
I deploy NVIDIA DCGM for comprehensive GPU monitoring:
apiVersion: v1
kind: Service
metadata:
name: dcgm-exporter
namespace: gpu-operator
spec:
type: ClusterIP
ports:
- name: metrics
port: 9400
targetPort: 9400
protocol: TCP
selector:
app: nvidia-dcgm-exporter
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: gpu-operator
spec:
endpoints:
- interval: 30s
path: /metrics
port: metrics
selector:
matchLabels:
app: nvidia-dcgm-exporterCustom Grafana Dashboard for ML Workloads
{
"dashboard": {
"title": "ML Infrastructure Overview",
"panels": [
{
"title": "GPU Utilization by Model",
"targets": [
{
"expr": "avg by (model_name) (DCGM_FI_DEV_GPU_UTIL{job=\"dcgm-exporter\"})"
}
]
},
{
"title": "Inference Latency P99",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(inference_duration_seconds_bucket[5m]))"
}
]
},
{
"title": "Training Loss Over Time",
"targets": [
{
"expr": "training_loss{job=\"kubeflow-metrics\"}"
}
]
},
{
"title": "GPU Memory Usage",
"targets": [
{
"expr": "DCGM_FI_DEV_MEM_COPY_UTIL{job=\"dcgm-exporter\"}"
}
]
}
]
}
}Production Best Practices
1. Resource Quotas for ML Namespaces
apiVersion: v1
kind: ResourceQuota
metadata:
name: ml-team-quota
namespace: ml-production
spec:
hard:
requests.nvidia.com/gpu: 10
requests.memory: 500Gi
persistentvolumeclaims: 20
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass
values: ["high", "medium"]2. Priority Classes for ML Workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: production-inference
value: 1000
globalDefault: false
description: "Priority for production inference workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: training-experiments
value: 100
globalDefault: false
description: "Priority for training experiments"3. Network Policies for Model Security
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: model-serving-network-policy
namespace: ml-production
spec:
podSelector:
matchLabels:
app: model-serving
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: api-gateway
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: feature-store
ports:
- protocol: TCP
port: 6379Case Study: Financial Services ML Platform
Challenge
A major financial institution needed to deploy 50+ fraud detection models with: - Sub-100ms latency requirements - 99.99% availability SLA - Cost constraints of $500K annual budget - Compliance with financial regulations
Solution Architecture
# Multi-region deployment for HA
apiVersion: v1
kind: Namespace
metadata:
name: fraud-detection-prod
labels:
compliance: pci-dss
environment: production
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-model-ensemble
spec:
predictor:
minReplicas: 3
maxReplicas: 20
scaleTarget: 70
containers:
- name: model-server
image: fraud-ensemble:v3.2
resources:
requests:
nvidia.com/gpu: 1
memory: 16Gi
env:
- name: ENABLE_BATCHING
value: "true"
- name: MAX_BATCH_SIZE
value: "32"
- name: BATCH_TIMEOUT_MS
value: "10"Results
- Latency: Achieved 45ms P99 latency (55% better than requirement)
- Availability: 99.995% uptime over 6 months
- Cost: $380K annual run rate (24% under budget)
- Scale: Processing 10M+ transactions daily
Conclusion
Building production-ready ML infrastructure on Kubernetes requires deep understanding of both Kubernetes primitives and ML-specific requirements. The patterns and practices I’ve shared here - from GPU orchestration and model serving to cost optimization and monitoring - have been battle-tested across multiple production deployments.
Key takeaways for successful ML infrastructure:
- Start with GPU utilization - Measure and optimize from day one
- Implement checkpointing early - Essential for cost-effective spot instance usage
- Design for experimentation - ML teams need to iterate quickly
- Monitor everything - GPU metrics, model performance, and costs
- Automate operations - Use operators to handle routine tasks
The ML infrastructure space is evolving rapidly, but these foundational patterns provide a solid base for building scalable, cost-effective platforms that accelerate your organization’s AI initiatives.
Need help building your ML infrastructure? I offer consulting services for organizations looking to deploy production-grade ML platforms on Kubernetes. Schedule a consultation to discuss your specific requirements.