Setting up a production-ready Kubernetes cluster remains one of the most challenging aspects of modern infrastructure management. Many organizations struggle with the transition from “Hello World” examples to a truly enterprise-grade environment. I’ve led dozens of enterprise Kubernetes implementations and seen the same patterns repeatedly:
This scenario is both common and unnecessary. With proper planning and a systematic approach, you can establish a production-ready Kubernetes environment in a single day. This isn’t about cutting corners—it’s about applying hard-earned experience to avoid common pitfalls.
In this article, I’ll share my battle-tested approach for rapidly deploying enterprise-grade Kubernetes across any infrastructure, refined through implementations spanning financial services, healthcare, and technology sectors.
Before provisioning any infrastructure, meticulous preparation is crucial. I use this checklist with every client:
Networking: - [ ] IP address ranges for nodes, pods, and services - [ ] Ingress/egress strategy - [ ] Internal/external DNS strategy - [ ] Load balancer configuration - [ ] Network policy requirements
Compute: - [ ] Node sizing for control plane and workers - [ ] Local vs. cloud storage requirements - [ ] Availability zones/regions - [ ] Bare metal, virtual, or cloud instances - [ ] CPU and memory quotas by environment
Security: - [ ] Authentication approach (OIDC, SSO) - [ ] Authorization model (RBAC approach) - [ ] Network isolation requirements - [ ] Image scanning strategy - [ ] Secrets management approach
Monitoring: - [ ] Metrics collection scope - [ ] Alerting thresholds and escalation - [ ] Visualization needs - [ ] Log aggregation strategy - [ ] Tracing requirements (if applicable)
Backup/Recovery: - [ ] Recovery Time Objective (RTO) - [ ] Recovery Point Objective (RPO) - [ ] Backup schedule and retention - [ ] DR testing approach
Maintenance: - [ ] Upgrade strategy - [ ] Maintenance window requirements - [ ] Patching responsibility - [ ] Certificate rotation plan
Workload Profile: - [ ] Application resource requirements - [ ] Stateful workload needs - [ ] Application deployment strategy - [ ] Service mesh requirements - [ ] External service dependencies
Developer Experience: - [ ] CI/CD integration approach - [ ] Self-service capabilities - [ ] Development environment strategy - [ ] Access control model
This preparation phase typically takes 2-4 hours but saves days of rework later.
With the preparation complete, we can move to implementation. This process is divided into four phases, each with clear deliverables.
I implement all infrastructure as code using Terraform. Here’s a simplified version of my standard template:
# main.tf - AWS EKS Example (similar modules exist for GKE, AKS, etc.)
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 3.0"
name = "k8s-vpc"
cidr = "10.0.0.0/16"
azs = ["us-west-2a", "us-west-2b", "us-west-2c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = false
enable_dns_hostnames = true
public_subnet_tags = {
"kubernetes.io/role/elb" = "1"
}
private_subnet_tags = {
"kubernetes.io/role/internal-elb" = "1"
}
}
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 18.0"
cluster_name = "production-eks"
cluster_version = "1.26"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
# Control plane logging
cluster_enabled_log_types = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
# OIDC Integration for service accounts
enable_irsa = true
# Node groups
eks_managed_node_groups = {
system = {
name = "system-node-group"
instance_types = ["m5.large"]
capacity_type = "ON_DEMAND"
min_size = 2
max_size = 4
desired_size = 2
# Ensure system workloads run on these nodes
labels = {
workload-type = "system"
}
taints = {
dedicated = {
key = "dedicated"
value = "system"
effect = "NO_SCHEDULE"
}
}
}
application = {
name = "app-node-group"
instance_types = ["m5.xlarge"]
capacity_type = "ON_DEMAND"
min_size = 3
max_size = 20
desired_size = 3
labels = {
workload-type = "application"
}
}
}
# Managed node groups need additional IAM policies
node_security_group_additional_rules = {
ingress_self_all = {
description = "Node to node all ports/protocols"
protocol = "-1"
from_port = 0
to_port = 0
type = "ingress"
self = true
}
}
# RBAC configuration
manage_aws_auth_configmap = true
aws_auth_roles = [
{
rolearn = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/Admin"
username = "admin"
groups = ["system:masters"]
},
]
}
# Identity and access management
module "eks_admin_iam" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
version = "~> 5.3"
role_name = "eks-admin"
oidc_providers = {
main = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["kube-system:cluster-admin-sa"]
}
}
}
This infrastructure code accomplishes several key objectives: 1. Creates a properly configured VPC with subnets across three availability zones 2. Provisions an EKS cluster with current best practices 3. Sets up node groups with proper taints and labels for workload isolation 4. Configures logging, OIDC integration, and basic RBAC 5. Establishes security groups with appropriate access controls
For on-premises environments, I use similar templates with Terraform providers for VMware, OpenStack, or bare metal provisioning tools.
Estimated time for this phase: 30-45 minutes, mostly waiting for infrastructure provisioning.
Once the infrastructure is ready, I deploy a GitOps pipeline for managing the cluster configuration. This approach provides several benefits: - Declarative configuration management - Audit trail for all changes - Easy rollback capabilities - Multi-cluster consistency
Here’s how I set up ArgoCD for GitOps:
# argocd-install.yaml
apiVersion: v1
kind: Namespace
metadata:
name: argocd
---
# Apply the latest ArgoCD installation manifest
# Using the HA version for production environments
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml
patches:
# Configure RBAC integration with OIDC
- target:
kind: ConfigMap
name: argocd-cm
patch: |
- op: add
path: /data/oidc.config
value: |
name: OIDC
issuer: https://accounts.google.com
clientID: $OIDC_CLIENT_ID
clientSecret: $OIDC_CLIENT_SECRET requestedScopes: ["openid", "profile", "email"]
Next, I create the GitOps repository structure:
k8s-infra/
├── clusters/
│ └── production/
│ ├── kustomization.yaml
│ └── cluster-config.yaml
├── infrastructure/
│ ├── cert-manager/
│ ├── external-dns/
│ ├── ingress-nginx/
│ ├── prometheus/
│ └── vault/
├── namespaces/
│ ├── development/
│ ├── staging/
│ └── production/
└── policies/
├── network-policies/
├── pod-security/
└── resource-quotas/
With the repository structure in place, I deploy ArgoCD:
# Apply ArgoCD installation
kubectl apply -k argocd-install.yaml
# Wait for pods to be ready
kubectl wait --for=condition=Ready pods --all -n argocd --timeout=300s
# Get initial admin password
ARGO_PASSWORD=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
echo "ArgoCD Initial Password: $ARGO_PASSWORD"
# Deploy ArgoCD Applications for infrastructure components
kubectl apply -f applications.yaml
The applications.yaml file contains the ArgoCD Applications that define what to deploy:
# applications.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: infrastructure
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/company/k8s-infra.git
targetRevision: HEAD
path: infrastructure
destination:
server: https://kubernetes.default.svc
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: policies
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/company/k8s-infra.git
targetRevision: HEAD
path: policies
destination:
server: https://kubernetes.default.svc
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
Estimated time for this phase: 15-20 minutes.
With the GitOps pipeline established, I deploy essential components for a production-ready environment. These components are defined in the infrastructure directory of our GitOps repository:
Cert-manager automates the issuance and renewal of TLS certificates:
# cert-manager/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- https://github.com/cert-manager/cert-manager/releases/download/v1.12.0/cert-manager.yaml
# cert-manager/letsencrypt-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: ops@company.com
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
NGINX Ingress Controller for routing external traffic:
# ingress-nginx/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.8.0/deploy/static/provider/cloud/deploy.yaml
patchesStrategicMerge:
- ingress-config.yaml
# ingress-nginx/ingress-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: ingress-nginx-controller
namespace: ingress-nginx
data:
use-proxy-protocol: "true"
use-forwarded-headers: "true"
proxy-body-size: "10m"
http-snippet: |
server {
listen 2443;
return 308 https://$host$request_uri; }
External-DNS automates DNS record management:
# external-dns/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- rbac.yaml
# external-dns/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-dns
namespace: kube-system
spec:
strategy:
type: Recreate
selector:
matchLabels:
app: external-dns
template:
metadata:
labels:
app: external-dns
spec:
serviceAccountName: external-dns
containers:
- name: external-dns
image: registry.k8s.io/external-dns/external-dns:v0.13.5
args:
- --source=service
- --source=ingress
- --provider=aws
- --registry=txt
- --txt-owner-id=k8s-production
Prometheus and Grafana for comprehensive monitoring:
# prometheus/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: monitoring
resources:
- namespace.yaml
- github.com/prometheus-operator/kube-prometheus//manifests/setup
- github.com/prometheus-operator/kube-prometheus//manifests?ref=v0.12.0
patchesStrategicMerge:
- prometheus-config.yaml
- grafana-config.yaml
# prometheus/grafana-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
namespace: monitoring
data:
k8s-system-overview.json: |
{
"title": "Kubernetes System Overview",
"uid": "k8s-system-overview",
"...": "..." # Dashboard definition }
Vault for secure secrets management:
# vault/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: vault
resources:
- namespace.yaml
- https://github.com/hashicorp/vault-helm/releases/download/v0.23.0/vault-helm-0.23.0.tgz
patchesStrategicMerge:
- vault-config.yaml
# vault/vault-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: vault-config
namespace: vault
data:
config.json: |
{
"listener": {
"tcp": {
"address": "0.0.0.0:8200",
"tls_disable": true
}
},
"storage": {
"file": {
"path": "/vault/data"
}
},
"ui": true }
EFK (Elasticsearch, Fluentd, Kibana) stack for centralized logging:
# logging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: logging
resources:
- namespace.yaml
- elasticsearch.yaml
- fluentd.yaml
- kibana.yaml
# logging/fluentd.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: logging
spec:
selector:
matchLabels:
app: fluentd
template:
metadata:
labels:
app: fluentd
spec:
serviceAccountName: fluentd
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1.14-debian-elasticsearch7-1
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch.logging.svc.cluster.local"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
OPA Gatekeeper for policy enforcement:
# policies/gatekeeper/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- https://raw.githubusercontent.com/open-policy-agent/gatekeeper/release-3.11/deploy/gatekeeper.yaml
# policies/gatekeeper/require-labels.yaml
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8srequiredlabels
spec:
crd:
spec:
names:
kind: K8sRequiredLabels
validation:
openAPIV3Schema:
properties:
labels:
type: array
items: string
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8srequiredlabels
violation[{"msg": msg, "details": {"missing_labels": missing}}] {
provided := {label | input.review.object.metadata.labels[label]}
required := {label | label := input.parameters.labels[_]}
missing := required - provided
count(missing) > 0
msg := sprintf("Missing required labels: %v", [missing]) }
Estimated time for this phase: 30-45 minutes.
With all components deployed, I run a comprehensive validation suite to ensure the environment is production-ready:
#!/bin/bash
# validation.sh
echo "=== Cluster Connectivity ==="
kubectl get nodes
if [ $? -ne 0 ]; then
echo "❌ Failed to connect to cluster"
exit 1
fi
echo "✅ Cluster connectivity verified"
echo "=== Control Plane Health ==="
for component in kube-apiserver kube-controller-manager kube-scheduler etcd; do
echo "Checking $component..."
READY=$(kubectl get pods -n kube-system -l component=$component -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}')
if [[ $READY != *"True"* ]]; then
echo "❌ $component not ready"
exit 1
fi
done
echo "✅ Control plane health verified"
echo "=== ArgoCD Status ==="
ARGO_READY=$(kubectl get pods -n argocd -l app.kubernetes.io/name=argocd-server -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}')
if [[ $ARGO_READY != *"True"* ]]; then
echo "❌ ArgoCD not ready"
exit 1
fi
echo "✅ ArgoCD health verified"
echo "=== Infrastructure Components ==="
for namespace in cert-manager ingress-nginx monitoring vault logging; do
echo "Checking $namespace namespace..."
kubectl get pods -n $namespace
if [ $? -ne 0 ]; then
echo "⚠️ Issues found in $namespace namespace"
fi
done
echo "=== Running Synthetic Tests ==="
# Create test deployment
kubectl create namespace test
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-deployment
namespace: test
spec:
replicas: 1
selector:
matchLabels:
app: test
template:
metadata:
labels:
app: test
spec:
containers:
- name: nginx
image: nginx:stable
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: test-service
namespace: test
spec:
selector:
app: test
ports:
- port: 80
targetPort: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: test-ingress
namespace: test
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: nginx
tls:
- hosts:
- test.example.com
secretName: test-tls
rules:
- host: test.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: test-service
port:
number: 80
EOF
echo "Waiting for test deployment to be ready..."
kubectl wait --for=condition=available deployment/test-deployment -n test --timeout=120s
echo "=== Checking Monitoring ==="
PROMETHEUS_ENDPOINT=$(kubectl get svc -n monitoring prometheus-k8s -o jsonpath='{.spec.clusterIP}')
curl -s "http://$PROMETHEUS_ENDPOINT:9090/api/v1/targets" | grep "up=\"1\""
if [ $? -ne 0 ]; then
echo "⚠️ Some monitoring targets may be down"
else
echo "✅ Monitoring targets up"
fi
echo "=== Validation Complete ==="
echo "Clean up test namespace? (y/n)"
read cleanup
if [ "$cleanup" == "y" ]; then
kubectl delete namespace test
fi
I also perform a security assessment:
# Run kube-bench for CIS benchmark validation
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
name: kube-bench
namespace: default
spec:
template:
spec:
hostPID: true
containers:
- name: kube-bench
image: aquasec/kube-bench:latest
command: ["kube-bench"]
volumeMounts:
- name: var-lib-kubelet
mountPath: /var/lib/kubelet
- name: etc-systemd
mountPath: /etc/systemd
- name: etc-kubernetes
mountPath: /etc/kubernetes
restartPolicy: Never
volumes:
- name: var-lib-kubelet
hostPath:
path: /var/lib/kubelet
- name: etc-systemd
hostPath:
path: /etc/systemd
- name: etc-kubernetes
hostPath:
path: /etc/kubernetes
EOF
# Wait for job to complete
kubectl wait --for=condition=complete job/kube-bench --timeout=300s
# Get results
kubectl logs job/kube-bench
Estimated time for this phase: 20-30 minutes.
To accelerate future deployments, I maintain a GitHub repository with all the necessary scripts and templates:
https://github.com/jagadesh/k8s-enterprise-setup
This repository includes:
The repository follows a structure that makes it easy to adapt to different environments:
k8s-enterprise-setup/
├── terraform/
│ ├── aws/
│ ├── azure/
│ ├── gcp/
│ └── on-premise/
├── gitops/
│ ├── argocd/
│ └── flux/
├── components/
│ ├── ingress/
│ ├── monitoring/
│ ├── logging/
│ └── security/
├── validation/
│ ├── health-checks/
│ ├── performance/
│ └── security/
└── docs/
├── architecture/
├── runbooks/
└── handover/
Through dozens of implementations, I’ve identified common pitfalls that delay production readiness:
Problem: IP range conflicts, service mesh issues, and cross-namespace communication problems.
Solution: Use a comprehensive IP address management (IPAM) strategy:
# network-planning.yaml
ClusterCIDR: 10.200.0.0/16 # Pods
ServiceCIDR: 10.201.0.0/16 # Services
# Node subnets by zone
Zone-A: 10.0.0.0/24
Zone-B: 10.0.1.0/24
Zone-C: 10.0.2.0/24
# Ensure non-overlapping ranges with other networks
Corporate-DataCenter: 10.50.0.0/16
Existing-VPCs: 10.100.0.0/16, 172.16.0.0/16
Problem: Under-provisioned nodes leading to resource contention or over-provisioned resources increasing costs.
Solution: Implement proper resource quotas from day one:
# resource-quotas.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
namespace: production
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "20"
Coupled with default resource requests/limits:
# limit-range.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- default:
memory: 512Mi
cpu: 500m
defaultRequest:
memory: 256Mi
cpu: 200m
type: Container
Problem: Focus on initial deployment without planning for ongoing maintenance.
Solution: Implement automation for common operational tasks:
# maintenance/node-drainer.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: node-drainer
namespace: kube-system
spec:
schedule: "0 2 * * 0" # 2 AM every Sunday
jobTemplate:
spec:
template:
spec:
serviceAccountName: node-drainer
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
for node in $(kubectl get nodes -l maintenance=weekly -o name); do
echo "Draining $node"
kubectl drain $node --ignore-daemonsets --delete-emptydir-data
sleep 300
kubectl uncordon $node
done restartPolicy: OnFailure
Problem: Implementing security controls after workloads are deployed, leading to resistance and compliance issues.
Solution: Deploy security controls from day one:
# security/pod-security-standards.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSPPrivilegedContainer
metadata:
name: prevent-privileged-containers
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
excludedNamespaces: ["kube-system"]
parameters: {}
Problem: No automated backup solution, leading to potential data loss.
Solution: Implement Velero for Kubernetes-native backup from day one:
# backup/velero-install.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: velero
resources:
- namespace.yaml
- https://github.com/vmware-tanzu/velero/releases/download/v1.10.0/velero-v1.10.0-linux-amd64.tar.gz
patchesStrategicMerge:
- velero-schedule.yaml
# backup/velero-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 1 * * *"
template:
includedNamespaces:
- default
- production
- monitoring
excludedResources:
- pods
- events
includedResourceGroups:
- "*"
ttl: 720h # 30 days
storageLocation: default
Setting up an enterprise-grade Kubernetes environment doesn’t need to be a months-long journey. With proper planning, automation, and a systematic approach, you can deploy a production-ready platform in a single day.
The key elements of success are:
This approach has consistently delivered reliable, secure, and scalable Kubernetes environments for organizations ranging from startups to large enterprises. By following these guidelines, you can significantly accelerate your journey to production while establishing a solid foundation for future growth.
Remember that a successful Kubernetes implementation isn’t just about the initial deployment—it’s about building a platform that enables your organization to deploy and manage applications confidently, securely, and efficiently over the long term.
Looking for assistance with your Kubernetes implementation? Feel free to reach out to discuss your specific requirements and challenges. I offer consulting services tailored to your organization’s needs, from initial planning through deployment and ongoing operations.