As organizations race to deploy AI workloads, they’re discovering that traditional Kubernetes knowledge isn’t enough. GPU scheduling, model serving at scale, and distributed training require specialized infrastructure expertise that bridges the gap between ML engineers and platform teams.
The challenges are multifaceted: GPU resources are expensive and often underutilized, model serving requires sub-second latency at scale, training jobs need fault tolerance for spot instances, and the entire platform must support rapid experimentation while maintaining production stability.
After implementing ML platforms for multiple enterprises, from financial services running fraud detection models to healthcare organizations processing medical imaging, I’ve developed battle-tested patterns for production ML infrastructure.
Unlike CPU and memory, GPUs in Kubernetes are treated as extended resources that cannot be oversubscribed. This fundamental difference requires careful planning:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:11.8.0-base-ubuntu22.04
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
The NVIDIA GPU Operator automates the management of all NVIDIA software components needed for GPU provisioning:
# Install GPU Operator via Helm
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set migManager.enabled=true \
--set dcgmExporter.enabled=true
MIG allows you to partition A100 and A30 GPUs into smaller instances, perfect for inference workloads:
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-parted-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-1g.5gb:
- devices: [0,1,2,3]
mig-enabled: true
mig-devices: 1g.5gb: 7
This configuration creates 7 MIG instances per GPU, each with 5GB of memory - perfect for serving smaller models.
I implement sophisticated node labeling for workload placement:
// Go operator code for dynamic GPU labeling
func (r *GPUNodeReconciler) labelGPUNodes(ctx context.Context) error {
:= &corev1.NodeList{}
nodes if err := r.List(ctx, nodes); err != nil {
return err
}
for _, node := range nodes.Items {
:= node.Status.Capacity["nvidia.com/gpu"]
gpuCount if gpuCount.IsZero() {
continue
}
// Add labels based on GPU type
:= map[string]string{
labels "gpu.nvidia.com/class": detectGPUClass(&node),
"gpu.nvidia.com/memory": detectGPUMemory(&node),
"workload.gpu/type": categorizeWorkload(&node),
}
.Labels = labels
nodeif err := r.Update(ctx, &node); err != nil {
return err
}
}
return nil
}
KServe provides a serverless framework for model serving with features like autoscaling, canary rollouts, and multi-model serving:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-detection-model
spec:
predictor:
minReplicas: 2
maxReplicas: 10
scaleTarget: 80 # GPU utilization target
scaleMetric: gpu
containers:
- name: kserve-container
image: fraud-model:v2.1
resources:
requests:
nvidia.com/gpu: 1
memory: 8Gi
limits:
nvidia.com/gpu: 1
memory: 8Gi
env:
- name: STORAGE_URI
value: gs://models/fraud-detection/v2.1
- name: MODEL_NAME
value: fraud_detection
gpu:
runtimeVersion: 23.08-py3
storageUri: gs://model-store/fraud-detection
transformer:
containers:
- name: transformer
image: custom-transformer:latest
env:
- name: FEATURE_STORE_URI
value: redis://feature-store:6379
A/B testing is crucial for safe model rollouts:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: recommendation-model
spec:
predictor:
canaryTrafficPercent: 20
containers:
- name: model-a
image: rec-model:v1.0
canary:
containers:
- name: model-b
image: rec-model:v2.0
For high-throughput inference, I implement batching and caching strategies:
// Custom Go transformer for request batching
type BatchTransformer struct {
int
batchSize .Duration
batchTimeout timechan *Request
requestQueue string
modelEndpoint }
func (bt *BatchTransformer) Process(ctx context.Context) {
:= make([]*Request, 0, bt.batchSize)
batch := time.NewTimer(bt.batchTimeout)
timer
for {
select {
case req := <-bt.requestQueue:
= append(batch, req)
batch if len(batch) >= bt.batchSize {
.sendBatch(batch)
bt= batch[:0]
batch .Reset(bt.batchTimeout)
timer}
case <-timer.C:
if len(batch) > 0 {
.sendBatch(batch)
bt= batch[:0]
batch }
.Reset(bt.batchTimeout)
timercase <-ctx.Done():
return
}
}
}
Kubeflow Pipelines provide a platform for building and deploying ML workflows:
# pipeline.py - Distributed training pipeline
import kfp
from kfp import dsl
from kfp.v2 import compiler
@dsl.component(
='pytorch/pytorch:2.0.0-cuda11.8-cudnn8-runtime',
base_image=['torch-distributed', 'transformers']
packages_to_install
)def distributed_training_op(
str,
model_name: str,
dataset_path: int,
num_epochs: int
num_gpus: -> str:
) import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel
# Initialize distributed training
='nccl')
dist.init_process_group(backend= int(os.environ['LOCAL_RANK'])
local_rank
torch.cuda.set_device(local_rank)
# Model setup with DDP
= load_model(model_name)
model = model.cuda(local_rank)
model = DistributedDataParallel(model, device_ids=[local_rank])
model
# Training loop
for epoch in range(num_epochs):
train_epoch(model, dataset_path, epoch)if local_rank == 0:
save_checkpoint(model, epoch)
return f"gs://models/{model_name}/final"
@dsl.pipeline(
='Distributed Training Pipeline',
name='Multi-GPU distributed training pipeline'
description
)def training_pipeline(
str = 'bert-large',
model_name: str = 'gs://datasets/custom',
dataset_path: int = 10
num_epochs:
):# Distributed training with 4 GPUs
= distributed_training_op(
training_task =model_name,
model_name=dataset_path,
dataset_path=num_epochs,
num_epochs=4
num_gpus4).set_memory_limit('32G') ).set_gpu_limit(
For more control over distributed training, I use PyTorchJob:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: llm-finetuning
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: llm-training:latest
imagePullPolicy: Always
resources:
limits:
nvidia.com/gpu: 8
requests:
nvidia.com/gpu: 8
memory: 64Gi
command:
- python
- train.py
- --model=llama2-7b
- --distributed
- --checkpoint-freq=1000
volumeMounts:
- name: checkpoint-storage
mountPath: /checkpoints
volumes:
- name: checkpoint-storage
persistentVolumeClaim:
claimName: training-checkpoints
Worker:
replicas: 3
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: llm-training:latest
resources:
limits:
nvidia.com/gpu: 8
requests:
nvidia.com/gpu: 8
memory: 64Gi
I’ve developed a comprehensive spot instance strategy that reduces costs by 60-70%:
apiVersion: v1
kind: ConfigMap
metadata:
name: spot-handler-config
data:
handler.sh: |
#!/bin/bash
# Spot instance termination handler
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
while true; do
# Check for spot termination notice
HTTP_CODE=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -o /dev/null -w "%{http_code}" http://169.254.169.254/latest/meta-data/spot/instance-action)
if [ "$HTTP_CODE" == "200" ]; then
echo "Spot instance termination notice received"
# Checkpoint current training state
kubectl exec -n training $(kubectl get pods -n training -l job=current -o name) -- python -c "
import torch
import os
checkpoint = {
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}
torch.save(checkpoint, '/checkpoints/emergency_checkpoint.pt')
os.system('gsutil cp /checkpoints/emergency_checkpoint.pt gs://checkpoints/spot-recovery/')
"
# Drain node
kubectl drain $(hostname) --force --ignore-daemonsets
break
fi
sleep 5 done
My custom operator ensures no work is lost when using spot instances:
// Checkpoint operator in Go
type CheckpointReconciler struct {
.Client
client*runtime.Scheme
Scheme }
func (r *CheckpointReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
:= &batchv1.Job{}
job if err := r.Get(ctx, req.NamespacedName, job); err != nil {
return ctrl.Result{}, err
}
// Check if job is running on spot instance
if isSpotInstance(job) {
// Set up periodic checkpointing
:= &batchv1.CronJob{
cronJob : metav1.ObjectMeta{
ObjectMeta: fmt.Sprintf("%s-checkpoint", job.Name),
Name: job.Namespace,
Namespace},
: batchv1.CronJobSpec{
Spec: "*/10 * * * *", // Every 10 minutes
Schedule: batchv1.JobTemplateSpec{
JobTemplate: batchv1.JobSpec{
Spec: corev1.PodTemplateSpec{
Template: corev1.PodSpec{
Spec: []corev1.Container{{
Containers: "checkpointer",
Name: "checkpoint-saver:latest",
Image: []string{
Command"/bin/sh", "-c",
"kubectl exec " + job.Name + " -- /checkpoint.sh",
},
}},
},
},
},
},
},
}
if err := r.Create(ctx, cronJob); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{RequeueAfter: time.Minute}, nil
}
For development workloads, I implement GPU time-slicing to maximize utilization:
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-sharing-config
namespace: gpu-operator
data:
any: |-
version: v1
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu replicas: 4 # Create 4 virtual GPUs per physical GPU
I deploy NVIDIA DCGM for comprehensive GPU monitoring:
apiVersion: v1
kind: Service
metadata:
name: dcgm-exporter
namespace: gpu-operator
spec:
type: ClusterIP
ports:
- name: metrics
port: 9400
targetPort: 9400
protocol: TCP
selector:
app: nvidia-dcgm-exporter
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: gpu-operator
spec:
endpoints:
- interval: 30s
path: /metrics
port: metrics
selector:
matchLabels:
app: nvidia-dcgm-exporter
{
"dashboard": {
"title": "ML Infrastructure Overview",
"panels": [
{
"title": "GPU Utilization by Model",
"targets": [
{
"expr": "avg by (model_name) (DCGM_FI_DEV_GPU_UTIL{job=\"dcgm-exporter\"})"
}
]
},
{
"title": "Inference Latency P99",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(inference_duration_seconds_bucket[5m]))"
}
]
},
{
"title": "Training Loss Over Time",
"targets": [
{
"expr": "training_loss{job=\"kubeflow-metrics\"}"
}
]
},
{
"title": "GPU Memory Usage",
"targets": [
{
"expr": "DCGM_FI_DEV_MEM_COPY_UTIL{job=\"dcgm-exporter\"}"
}
]
}
]
}
}
apiVersion: v1
kind: ResourceQuota
metadata:
name: ml-team-quota
namespace: ml-production
spec:
hard:
requests.nvidia.com/gpu: 10
requests.memory: 500Gi
persistentvolumeclaims: 20
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass
values: ["high", "medium"]
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: production-inference
value: 1000
globalDefault: false
description: "Priority for production inference workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: training-experiments
value: 100
globalDefault: false
description: "Priority for training experiments"
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: model-serving-network-policy
namespace: ml-production
spec:
podSelector:
matchLabels:
app: model-serving
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: api-gateway
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: feature-store
ports:
- protocol: TCP
port: 6379
A major financial institution needed to deploy 50+ fraud detection models with: - Sub-100ms latency requirements - 99.99% availability SLA - Cost constraints of $500K annual budget - Compliance with financial regulations
# Multi-region deployment for HA
apiVersion: v1
kind: Namespace
metadata:
name: fraud-detection-prod
labels:
compliance: pci-dss
environment: production
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-model-ensemble
spec:
predictor:
minReplicas: 3
maxReplicas: 20
scaleTarget: 70
containers:
- name: model-server
image: fraud-ensemble:v3.2
resources:
requests:
nvidia.com/gpu: 1
memory: 16Gi
env:
- name: ENABLE_BATCHING
value: "true"
- name: MAX_BATCH_SIZE
value: "32"
- name: BATCH_TIMEOUT_MS
value: "10"
Building production-ready ML infrastructure on Kubernetes requires deep understanding of both Kubernetes primitives and ML-specific requirements. The patterns and practices I’ve shared here - from GPU orchestration and model serving to cost optimization and monitoring - have been battle-tested across multiple production deployments.
Key takeaways for successful ML infrastructure:
The ML infrastructure space is evolving rapidly, but these foundational patterns provide a solid base for building scalable, cost-effective platforms that accelerate your organization’s AI initiatives.
Need help building your ML infrastructure? I offer consulting services for organizations looking to deploy production-grade ML platforms on Kubernetes. Schedule a consultation to discuss your specific requirements.