About Me
Who I Am
I’m Jagadesh, a Senior Platform Engineer
specializing in AI/ML infrastructure on Kubernetes. Based in St. Louis,
Missouri, I help organizations build, optimize, and operate
production-grade ML platforms that deliver both technical excellence and
business value.
What sets me apart is my unique combination of deep Kubernetes
expertise and hands-on ML infrastructure experience. Unlike specialists
who focus solely on traditional workloads, I bring practical expertise
in GPU orchestration and ML-specific challenges:
- Google Kubernetes Engine (GKE) with GPU node pools
- Amazon Elastic Kubernetes Service (EKS) with GPU instances
- Azure Kubernetes Service (AKS) with GPU-enabled clusters
- NVIDIA GPU Operator and MIG configuration
- Kubeflow and KServe implementations
- Custom Go operators for ML workflows
This cross-platform ML infrastructure perspective allows me to
recommend and implement the right solutions for your AI/ML needs,
optimized for both performance and cost.
My Expertise
AI/ML Infrastructure
- GPU Orchestration: NVIDIA GPU Operator, MIG
configuration, and GPU scheduling optimization
- Model Serving Platforms: KServe, Seldon Core, and
Triton Inference Server deployments
- Training Infrastructure: Kubeflow pipelines,
distributed training with Horovod/PyTorch
- ML Workflow Automation: Custom operators for
experiment tracking and model lifecycle
- Cost Optimization: Spot GPU strategies reducing
training costs by 60-70%
- Go-based ML Operators: Building production-grade
operators for ML workflow automation
- Custom Controller Development: Extensive experience
with operator-sdk for ML platforms
- Cloud-Native ML Services: High-performance Go
services for model serving and data pipelines
- Platform Tooling: GPU monitoring, cost tracking,
and utilization optimization tools
GPU Resource Management
- Utilization Optimization: Improving GPU utilization
from 30% to 85%+
- Multi-Instance GPU (MIG): Maximizing inference
throughput with GPU partitioning
- Time-Slicing: Enabling GPU sharing for development
workloads
- Spot Instance Management: Automated checkpointing
and recovery for training jobs
- Kubeflow: End-to-end ML workflow orchestration and
pipeline management
- KServe/Seldon: Model serving at scale with
automatic scaling and A/B testing
- MLflow: Model registry, experiment tracking, and
versioning
- Ray on Kubernetes: Distributed training and
hyperparameter tuning
- Jupyter Enterprise Gateway: Scalable notebook
infrastructure
- Distribution-agnostic architecture: ML solutions
that work across any Kubernetes distribution
- Multi-cloud ML deployments: Consistent ML platforms
across AWS, GCP, and Azure
- Hybrid cloud integration: Connecting on-premise GPU
resources with cloud
- Production readiness: Enterprise-grade ML platforms
operational in days
- Migration strategies: Moving ML workloads between
platforms with minimal disruption
Cost Optimization for ML
- GPU resource efficiency: Right-sizing GPU instances
for specific model requirements
- Spot GPU strategies: 60-70% cost reduction with
intelligent checkpointing
- Inference optimization: Model quantization and
batching strategies
- Multi-tenant GPU sharing: Maximizing GPU
utilization across teams
Developer Experience
- ML pipeline automation: Self-service model training
and deployment
- Experiment tracking: Automated capture of metrics
and parameters
- Model versioning: GitOps for ML models and
configurations
- Development environments: GPU-enabled notebooks and
IDEs
Observability & Monitoring
- GPU metrics: Utilization, memory, temperature
monitoring
- Model performance: Latency, throughput, and
accuracy tracking
- Training metrics: Loss curves, learning rates, and
convergence monitoring
- Cost attribution: Per-team and per-model cost
tracking
Services Offered
I provide specialized consulting services tailored to your
organization’s ML infrastructure needs:
Strategic ML
Infrastructure Assessment
- GPU utilization analysis
- ML platform architecture review
- Cost optimization opportunities
- Model serving strategy evaluation
Implementation
- Production-ready GPU clusters
- Kubeflow and KServe deployment
- Custom ML operator development
- Distributed training infrastructure
Optimization
- GPU utilization improvement
- Inference latency reduction
- Training cost optimization
- Model serving efficiency
Knowledge Transfer
- ML infrastructure best practices
- Team training on Kubernetes for ML
- Documentation development
- Operational runbooks for ML platforms
Technical Background
My technical foundation includes:
- Languages: Go, Python, Java, Bash
- ML Infrastructure: Kubeflow, KServe, Seldon,
MLflow, Ray
- GPU Technologies: CUDA, NVIDIA GPU Operator, MIG,
nvidia-smi
- Kubernetes Development: operator-sdk, kubebuilder,
controller-runtime, client-go
- Infrastructure as Code: Terraform, Ansible,
Pulumi
- Cloud Platforms: AWS (with SageMaker integration),
GCP (with Vertex AI), Azure (with AzureML)
- Containerization: Docker, Containerd, NVIDIA
Container Toolkit
- Service Mesh: Istio, Linkerd (for model A/B
testing)
- Observability: Prometheus, Grafana, DCGM
Exporter
- CI/CD: GitHub Actions, GitLab CI, Tekton,
ArgoCD
Certifications
- Certified Kubernetes Administrator (CKA)
- Certified Kubernetes Application Developer (CKAD)
- Google Cloud Professional Cloud Architect
- Google Cloud Professional ML Engineer (in progress)
- AWS Certified Solutions Architect
- NVIDIA Deep Learning Institute Certifications
My Approach
I believe that ML infrastructure should accelerate innovation, not
create bottlenecks. My approach emphasizes:
- ML-first thinking: Every infrastructure decision
supports model development and deployment velocity
- Cost-conscious scaling: Balancing GPU performance
with budget constraints
- Developer empowerment: Self-service platforms that
don’t compromise governance
- Production reliability: ML systems that meet
enterprise SLAs
- Continuous optimization: Regular reviews of GPU
utilization and costs
Featured ML Infrastructure
Articles
The articles on this site showcase my expertise in ML infrastructure
engineering:
Building Production-Ready AI/ML
Infrastructure on Kubernetes - Complete guide to GPU
orchestration, model serving, and distributed training
From 30% to 85%: Optimizing GPU
Utilization - Practical strategies for maximizing GPU
efficiency in Kubernetes
Multi-Cloud ML Platform
Architecture - Building consistent ML infrastructure across
cloud providers
ML Infrastructure Cost
Optimization - Reducing GPU costs by 65% with spot
instances and smart scheduling
Building ML Pipeline Operators in
Go - Developing custom operators for ML workflow
automation
Zero to Production: ML Platform in
One Day - Rapid deployment of enterprise-grade ML
infrastructure
Recent ML Projects
GPU Cluster
Optimization for LLM Training
- Configured multi-node distributed training for 7B parameter
models
- Implemented automatic checkpointing for spot instance
interruptions
- Achieved 70% cost reduction while maintaining training
stability
- Deployed KServe with custom transformers for image
preprocessing
- Implemented model versioning with canary deployments
- Scaled to handle 100K+ inference requests per second
- Built end-to-end ML pipeline with Kubeflow
- Integrated with existing CI/CD for model deployment
- Implemented comprehensive model governance and audit trails
Let’s Connect
I’m always interested in discussing challenging ML infrastructure
problems and innovative solutions. Whether you’re looking for consulting
assistance, planning your ML platform strategy, or simply want to
exchange ideas about GPU orchestration and model serving, I’d love to
hear from you.
Contact me at: - Email: hello@jagadesh.dev - LinkedIn: linkedin.com/in/egntuywbw001
- GitHub: github.com/jagstack
Looking for ML infrastructure expertise? Schedule a free 30-minute
consultation to discuss your GPU orchestration and model deployment
challenges.