Jagadesh - Kubernetes Expert

Zero to Production: Setting Up Enterprise-Grade Kubernetes in One Day

Introduction: Common Enterprise K8s Setup Challenges

Setting up a production-ready Kubernetes cluster remains one of the most challenging aspects of modern infrastructure management. Many organizations struggle with the transition from “Hello World” examples to a truly enterprise-grade environment. I’ve led dozens of enterprise Kubernetes implementations and seen the same patterns repeatedly:

Initial exploration and lab testing goes smoothly
First production deployment hits unexpected complexity
Weeks or months of iterative improvement follow
Production readiness comes after painful lessons

This scenario is both common and unnecessary. With proper planning and a systematic approach, you can establish a production-ready Kubernetes environment in a single day. This isn’t about cutting corners—it’s about applying hard-earned experience to avoid common pitfalls.

In this article, I’ll share my battle-tested approach for rapidly deploying enterprise-grade Kubernetes across any infrastructure, refined through implementations spanning financial services, healthcare, and technology sectors.

Preparation Checklist

Before provisioning any infrastructure, meticulous preparation is crucial. I use this checklist with every client:

1. Infrastructure Requirements

Networking: - [ ] IP address ranges for nodes, pods, and services - [ ] Ingress/egress strategy - [ ] Internal/external DNS strategy - [ ] Load balancer configuration - [ ] Network policy requirements

Compute: - [ ] Node sizing for control plane and workers - [ ] Local vs. cloud storage requirements - [ ] Availability zones/regions - [ ] Bare metal, virtual, or cloud instances - [ ] CPU and memory quotas by environment

Security: - [ ] Authentication approach (OIDC, SSO) - [ ] Authorization model (RBAC approach) - [ ] Network isolation requirements - [ ] Image scanning strategy - [ ] Secrets management approach

2. Operational Requirements

Monitoring: - [ ] Metrics collection scope - [ ] Alerting thresholds and escalation - [ ] Visualization needs - [ ] Log aggregation strategy - [ ] Tracing requirements (if applicable)

Backup/Recovery: - [ ] Recovery Time Objective (RTO) - [ ] Recovery Point Objective (RPO) - [ ] Backup schedule and retention - [ ] DR testing approach

Maintenance: - [ ] Upgrade strategy - [ ] Maintenance window requirements - [ ] Patching responsibility - [ ] Certificate rotation plan

3. Application Requirements

Workload Profile: - [ ] Application resource requirements - [ ] Stateful workload needs - [ ] Application deployment strategy - [ ] Service mesh requirements - [ ] External service dependencies

Developer Experience: - [ ] CI/CD integration approach - [ ] Self-service capabilities - [ ] Development environment strategy - [ ] Access control model

4. Documentation Preparation

This preparation phase typically takes 2-4 hours but saves days of rework later.

Implementation Walkthrough

With the preparation complete, we can move to implementation. This process is divided into four phases, each with clear deliverables.

Phase 1: Infrastructure Provisioning with Terraform

I implement all infrastructure as code using Terraform. Here’s a simplified version of my standard template:

# main.tf - AWS EKS Example (similar modules exist for GKE, AKS, etc.)

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 3.0"
  
  name = "k8s-vpc"
  cidr = "10.0.0.0/16"
  
  azs             = ["us-west-2a", "us-west-2b", "us-west-2c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
  
  enable_nat_gateway   = true
  single_nat_gateway   = false
  enable_dns_hostnames = true
  
  public_subnet_tags = {
    "kubernetes.io/role/elb" = "1"
  }
  
  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = "1"
  }
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 18.0"
  
  cluster_name    = "production-eks"
  cluster_version = "1.26"
  
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets
  
  # Control plane logging
  cluster_enabled_log_types = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
  
  # OIDC Integration for service accounts
  enable_irsa = true
  
  # Node groups
  eks_managed_node_groups = {
    system = {
      name = "system-node-group"
      
      instance_types = ["m5.large"]
      capacity_type  = "ON_DEMAND"
      
      min_size     = 2
      max_size     = 4
      desired_size = 2
      
      # Ensure system workloads run on these nodes
      labels = {
        workload-type = "system"
      }
      
      taints = {
        dedicated = {
          key    = "dedicated"
          value  = "system"
          effect = "NO_SCHEDULE"
        }
      }
    }
    
    application = {
      name = "app-node-group"
      
      instance_types = ["m5.xlarge"]
      capacity_type  = "ON_DEMAND"
      
      min_size     = 3
      max_size     = 20
      desired_size = 3
      
      labels = {
        workload-type = "application"
      }
    }
  }
  
  # Managed node groups need additional IAM policies
  node_security_group_additional_rules = {
    ingress_self_all = {
      description = "Node to node all ports/protocols"
      protocol    = "-1"
      from_port   = 0
      to_port     = 0
      type        = "ingress"
      self        = true
    }
  }
  
  # RBAC configuration
  manage_aws_auth_configmap = true
  aws_auth_roles = [
    {
      rolearn  = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/Admin"
      username = "admin"
      groups   = ["system:masters"]
    },
  ]
}

# Identity and access management
module "eks_admin_iam" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
  version = "~> 5.3"
  
  role_name = "eks-admin"
  
  oidc_providers = {
    main = {
      provider_arn               = module.eks.oidc_provider_arn
      namespace_service_accounts = ["kube-system:cluster-admin-sa"]
    }
  }
}

This infrastructure code accomplishes several key objectives: 1. Creates a properly configured VPC with subnets across three availability zones 2. Provisions an EKS cluster with current best practices 3. Sets up node groups with proper taints and labels for workload isolation 4. Configures logging, OIDC integration, and basic RBAC 5. Establishes security groups with appropriate access controls

For on-premises environments, I use similar templates with Terraform providers for VMware, OpenStack, or bare metal provisioning tools.

Estimated time for this phase: 30-45 minutes, mostly waiting for infrastructure provisioning.

Phase 2: GitOps Deployment Pipeline

Once the infrastructure is ready, I deploy a GitOps pipeline for managing the cluster configuration. This approach provides several benefits: - Declarative configuration management - Audit trail for all changes - Easy rollback capabilities - Multi-cluster consistency

Here’s how I set up ArgoCD for GitOps:

# argocd-install.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: argocd
---
# Apply the latest ArgoCD installation manifest
# Using the HA version for production environments
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml
patches:
  # Configure RBAC integration with OIDC
  - target:
      kind: ConfigMap
      name: argocd-cm
    patch: |
      - op: add
        path: /data/oidc.config
        value: |
          name: OIDC
          issuer: https://accounts.google.com
          clientID: $OIDC_CLIENT_ID
          clientSecret: $OIDC_CLIENT_SECRET
          requestedScopes: ["openid", "profile", "email"]

Next, I create the GitOps repository structure:

k8s-infra/
├── clusters/
│   └── production/
│       ├── kustomization.yaml
│       └── cluster-config.yaml
├── infrastructure/
│   ├── cert-manager/
│   ├── external-dns/
│   ├── ingress-nginx/
│   ├── prometheus/
│   └── vault/
├── namespaces/
│   ├── development/
│   ├── staging/
│   └── production/
└── policies/
    ├── network-policies/
    ├── pod-security/
    └── resource-quotas/

With the repository structure in place, I deploy ArgoCD:

# Apply ArgoCD installation
kubectl apply -k argocd-install.yaml

# Wait for pods to be ready
kubectl wait --for=condition=Ready pods --all -n argocd --timeout=300s

# Get initial admin password
ARGO_PASSWORD=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
echo "ArgoCD Initial Password: $ARGO_PASSWORD"

# Deploy ArgoCD Applications for infrastructure components
kubectl apply -f applications.yaml

The applications.yaml file contains the ArgoCD Applications that define what to deploy:

# applications.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: infrastructure
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/company/k8s-infra.git
    targetRevision: HEAD
    path: infrastructure
  destination:
    server: https://kubernetes.default.svc
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: policies
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/company/k8s-infra.git
    targetRevision: HEAD
    path: policies
  destination:
    server: https://kubernetes.default.svc
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true

Estimated time for this phase: 15-20 minutes.

Phase 3: Required Day-1 Components

With the GitOps pipeline established, I deploy essential components for a production-ready environment. These components are defined in the infrastructure directory of our GitOps repository:

1. Certificate Management

Cert-manager automates the issuance and renewal of TLS certificates:

# cert-manager/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - https://github.com/cert-manager/cert-manager/releases/download/v1.12.0/cert-manager.yaml

# cert-manager/letsencrypt-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@company.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx

2. Ingress Controller

NGINX Ingress Controller for routing external traffic:

# ingress-nginx/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.8.0/deploy/static/provider/cloud/deploy.yaml
patchesStrategicMerge:
  - ingress-config.yaml

# ingress-nginx/ingress-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx
data:
  use-proxy-protocol: "true"
  use-forwarded-headers: "true"
  proxy-body-size: "10m"
  http-snippet: |
    server {
      listen 2443;
      return 308 https://$host$request_uri;
    }

3. External DNS

External-DNS automates DNS record management:

# external-dns/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - deployment.yaml
  - rbac.yaml

# external-dns/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-dns
  namespace: kube-system
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: external-dns
  template:
    metadata:
      labels:
        app: external-dns
    spec:
      serviceAccountName: external-dns
      containers:
      - name: external-dns
        image: registry.k8s.io/external-dns/external-dns:v0.13.5
        args:
        - --source=service
        - --source=ingress
        - --provider=aws
        - --registry=txt
        - --txt-owner-id=k8s-production

4. Monitoring & Alerting

Prometheus and Grafana for comprehensive monitoring:

# prometheus/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: monitoring
resources:
  - namespace.yaml
  - github.com/prometheus-operator/kube-prometheus//manifests/setup
  - github.com/prometheus-operator/kube-prometheus//manifests?ref=v0.12.0
patchesStrategicMerge:
  - prometheus-config.yaml
  - grafana-config.yaml

# prometheus/grafana-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
  namespace: monitoring
data:
  k8s-system-overview.json: |
    {
      "title": "Kubernetes System Overview",
      "uid": "k8s-system-overview",
      "...": "..." # Dashboard definition
    }

5. Secrets Management

Vault for secure secrets management:

# vault/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: vault
resources:
  - namespace.yaml
  - https://github.com/hashicorp/vault-helm/releases/download/v0.23.0/vault-helm-0.23.0.tgz
patchesStrategicMerge:
  - vault-config.yaml

# vault/vault-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: vault-config
  namespace: vault
data:
  config.json: |
    {
      "listener": {
        "tcp": {
          "address": "0.0.0.0:8200",
          "tls_disable": true
        }
      },
      "storage": {
        "file": {
          "path": "/vault/data"
        }
      },
      "ui": true
    }

6. Logging

EFK (Elasticsearch, Fluentd, Kibana) stack for centralized logging:

# logging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: logging
resources:
  - namespace.yaml
  - elasticsearch.yaml
  - fluentd.yaml
  - kibana.yaml

# logging/fluentd.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluentd
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      serviceAccountName: fluentd
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1.14-debian-elasticsearch7-1
        env:
          - name: FLUENT_ELASTICSEARCH_HOST
            value: "elasticsearch.logging.svc.cluster.local"
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

7. Policy Enforcement

OPA Gatekeeper for policy enforcement:

# policies/gatekeeper/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - https://raw.githubusercontent.com/open-policy-agent/gatekeeper/release-3.11/deploy/gatekeeper.yaml

# policies/gatekeeper/require-labels.yaml
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        openAPIV3Schema:
          properties:
            labels:
              type: array
              items: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels
        violation[{"msg": msg, "details": {"missing_labels": missing}}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("Missing required labels: %v", [missing])
        }

Estimated time for this phase: 30-45 minutes.

Phase 4: Validation and Testing

With all components deployed, I run a comprehensive validation suite to ensure the environment is production-ready:

#!/bin/bash
# validation.sh

echo "=== Cluster Connectivity ==="
kubectl get nodes
if [ $? -ne 0 ]; then
  echo "❌ Failed to connect to cluster"
  exit 1
fi
echo "✅ Cluster connectivity verified"

echo "=== Control Plane Health ==="
for component in kube-apiserver kube-controller-manager kube-scheduler etcd; do
  echo "Checking $component..."
  READY=$(kubectl get pods -n kube-system -l component=$component -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}')
  if [[ $READY != *"True"* ]]; then
    echo "❌ $component not ready"
    exit 1
  fi
done
echo "✅ Control plane health verified"

echo "=== ArgoCD Status ==="
ARGO_READY=$(kubectl get pods -n argocd -l app.kubernetes.io/name=argocd-server -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}')
if [[ $ARGO_READY != *"True"* ]]; then
  echo "❌ ArgoCD not ready"
  exit 1
fi
echo "✅ ArgoCD health verified"

echo "=== Infrastructure Components ==="
for namespace in cert-manager ingress-nginx monitoring vault logging; do
  echo "Checking $namespace namespace..."
  kubectl get pods -n $namespace
  if [ $? -ne 0 ]; then
    echo "⚠️ Issues found in $namespace namespace"
  fi
done

echo "=== Running Synthetic Tests ==="
# Create test deployment
kubectl create namespace test
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deployment
  namespace: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
    spec:
      containers:
      - name: nginx
        image: nginx:stable
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: test-service
  namespace: test
spec:
  selector:
    app: test
  ports:
  - port: 80
    targetPort: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: test-ingress
  namespace: test
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - test.example.com
    secretName: test-tls
  rules:
  - host: test.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: test-service
            port:
              number: 80
EOF

echo "Waiting for test deployment to be ready..."
kubectl wait --for=condition=available deployment/test-deployment -n test --timeout=120s

echo "=== Checking Monitoring ==="
PROMETHEUS_ENDPOINT=$(kubectl get svc -n monitoring prometheus-k8s -o jsonpath='{.spec.clusterIP}')
curl -s "http://$PROMETHEUS_ENDPOINT:9090/api/v1/targets" | grep "up=\"1\""
if [ $? -ne 0 ]; then
  echo "⚠️ Some monitoring targets may be down"
else
  echo "✅ Monitoring targets up"
fi

echo "=== Validation Complete ==="
echo "Clean up test namespace? (y/n)"
read cleanup
if [ "$cleanup" == "y" ]; then
  kubectl delete namespace test
fi

I also perform a security assessment:

# Run kube-bench for CIS benchmark validation
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: kube-bench
  namespace: default
spec:
  template:
    spec:
      hostPID: true
      containers:
      - name: kube-bench
        image: aquasec/kube-bench:latest
        command: ["kube-bench"]
        volumeMounts:
        - name: var-lib-kubelet
          mountPath: /var/lib/kubelet
        - name: etc-systemd
          mountPath: /etc/systemd
        - name: etc-kubernetes
          mountPath: /etc/kubernetes
      restartPolicy: Never
      volumes:
      - name: var-lib-kubelet
        hostPath:
          path: /var/lib/kubelet
      - name: etc-systemd
        hostPath:
          path: /etc/systemd
      - name: etc-kubernetes
        hostPath:
          path: /etc/kubernetes
EOF

# Wait for job to complete
kubectl wait --for=condition=complete job/kube-bench --timeout=300s

# Get results
kubectl logs job/kube-bench

Estimated time for this phase: 20-30 minutes.

Scripts and Templates Repository

To accelerate future deployments, I maintain a GitHub repository with all the necessary scripts and templates:

https://github.com/jagadesh/k8s-enterprise-setup

This repository includes:

Terraform modules for different cloud providers and on-premise environments
Helm charts with production-ready values for common components
Kustomize bases for various infrastructure components
Validation scripts for testing the environment
Documentation templates for architecture diagrams and runbooks

The repository follows a structure that makes it easy to adapt to different environments:

k8s-enterprise-setup/
├── terraform/
│   ├── aws/
│   ├── azure/
│   ├── gcp/
│   └── on-premise/
├── gitops/
│   ├── argocd/
│   └── flux/
├── components/
│   ├── ingress/
│   ├── monitoring/
│   ├── logging/
│   └── security/
├── validation/
│   ├── health-checks/
│   ├── performance/
│   └── security/
└── docs/
    ├── architecture/
    ├── runbooks/
    └── handover/

Common Pitfalls and How to Avoid Them

Through dozens of implementations, I’ve identified common pitfalls that delay production readiness:

1. Insufficient Network Planning

Problem: IP range conflicts, service mesh issues, and cross-namespace communication problems.

Solution: Use a comprehensive IP address management (IPAM) strategy:

# network-planning.yaml
ClusterCIDR: 10.200.0.0/16      # Pods
ServiceCIDR: 10.201.0.0/16      # Services

# Node subnets by zone
Zone-A: 10.0.0.0/24
Zone-B: 10.0.1.0/24
Zone-C: 10.0.2.0/24

# Ensure non-overlapping ranges with other networks
Corporate-DataCenter: 10.50.0.0/16
Existing-VPCs: 10.100.0.0/16, 172.16.0.0/16

2. Inadequate Resource Allocation

Problem: Under-provisioned nodes leading to resource contention or over-provisioned resources increasing costs.

Solution: Implement proper resource quotas from day one:

# resource-quotas.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources
  namespace: production
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "20"

Coupled with default resource requests/limits:

# limit-range.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - default:
      memory: 512Mi
      cpu: 500m
    defaultRequest:
      memory: 256Mi
      cpu: 200m
    type: Container

3. Neglecting Day-2 Operations

Problem: Focus on initial deployment without planning for ongoing maintenance.

Solution: Implement automation for common operational tasks:

# maintenance/node-drainer.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: node-drainer
  namespace: kube-system
spec:
  schedule: "0 2 * * 0"  # 2 AM every Sunday
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: node-drainer
          containers:
          - name: kubectl
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              for node in $(kubectl get nodes -l maintenance=weekly -o name); do
                echo "Draining $node"
                kubectl drain $node --ignore-daemonsets --delete-emptydir-data
                sleep 300
                kubectl uncordon $node
              done
          restartPolicy: OnFailure

4. Security as an Afterthought

Problem: Implementing security controls after workloads are deployed, leading to resistance and compliance issues.

Solution: Deploy security controls from day one:

# security/pod-security-standards.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSPPrivilegedContainer
metadata:
  name: prevent-privileged-containers
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
    excludedNamespaces: ["kube-system"]
  parameters: {}

5. Lacking Backup Strategy

Problem: No automated backup solution, leading to potential data loss.

Solution: Implement Velero for Kubernetes-native backup from day one:

# backup/velero-install.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: velero
resources:
  - namespace.yaml
  - https://github.com/vmware-tanzu/velero/releases/download/v1.10.0/velero-v1.10.0-linux-amd64.tar.gz
patchesStrategicMerge:
  - velero-schedule.yaml

# backup/velero-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 1 * * *"
  template:
    includedNamespaces:
    - default
    - production
    - monitoring
    excludedResources:
    - pods
    - events
    includedResourceGroups:
    - "*"
    ttl: 720h  # 30 days
    storageLocation: default

Conclusion

Setting up an enterprise-grade Kubernetes environment doesn’t need to be a months-long journey. With proper planning, automation, and a systematic approach, you can deploy a production-ready platform in a single day.

The key elements of success are:

Thorough preparation with a comprehensive checklist
Infrastructure as Code for repeatable, version-controlled deployments
GitOps workflows for ongoing configuration management
Essential day-1 components deployed in a structured manner
Rigorous validation to ensure production readiness
Anticipating common pitfalls and addressing them proactively

This approach has consistently delivered reliable, secure, and scalable Kubernetes environments for organizations ranging from startups to large enterprises. By following these guidelines, you can significantly accelerate your journey to production while establishing a solid foundation for future growth.

Remember that a successful Kubernetes implementation isn’t just about the initial deployment—it’s about building a platform that enables your organization to deploy and manage applications confidently, securely, and efficiently over the long term.

Looking for assistance with your Kubernetes implementation? Feel free to reach out to discuss your specific requirements and challenges. I offer consulting services tailored to your organization’s needs, from initial planning through deployment and ongoing operations.