Jagadesh - Kubernetes Expert

My Kubernetes Operator Development Playbook

Introduction: When Custom Operators Make Sense

After working with Kubernetes across multiple distributions and enterprises, I’ve found that the true power of Kubernetes often lies not in its out-of-the-box capabilities, but in its extensibility. Custom operators—Kubernetes-native applications that extend the platform’s functionality—have been among the most powerful tools in my arsenal for solving complex operational challenges.

However, building operators is not a trivial undertaking. They’re effectively distributed systems components that need to be resilient, performant, and secure. Over the years, I’ve developed a systematic approach to operator development that balances technical elegance with practical business value.

In this article, I’ll share my operator development playbook, refined through the creation of over a dozen production operators across various industries. Whether you’re considering your first operator or looking to improve your existing development process, these patterns and practices should help you build more effective, maintainable Kubernetes extensions.

When to Build an Operator (and When Not To)

Before diving into implementation, it’s crucial to determine whether an operator is the right solution. I evaluate potential operator projects using the following criteria:

Good Candidates for Operators

  1. Stateful applications with complex lifecycle management:
    • Databases with backup, restore, and scaling operations
    • Message queues with partitioning and rebalancing needs
    • Distributed systems with peer discovery requirements
  2. Cross-cutting operational concerns:
    • Multi-cluster configuration synchronization
    • Network policy management
    • Certificate management and rotation
  3. Domain-specific operational patterns:
    • Industry-specific compliance automation
    • Custom deployment strategies
    • Specialized health checks and remediation

Poor Candidates for Operators

  1. Simple stateless applications:
    • Better served by Deployments/StatefulSets with appropriate probes
  2. One-off automation tasks:
    • Better implemented as Jobs or CronJobs
  3. UI-heavy workflows:
    • Often better as separate applications with Kubernetes API integration

Decision Framework

I use a simple scoring matrix to evaluate operator candidates:

Criterion Weight Score (1-5)
Repetitive manual operations 3 ?
Complex lifecycle management 3 ?
Well-defined operational model 2 ?
Clear domain boundaries 2 ?
Need for Kubernetes-native integration 2 ?
Ongoing development resources 1 ?

Projects scoring above 40 (out of 65) typically make good operator candidates.

Planning Phase: Problem Definition and Scope

Once I’ve decided an operator is the right approach, I begin with a thorough planning phase:

Define the Domain Model

First, I map out the domain model by asking:

  1. What are the core resources being managed?
    • Define the “nouns” in your domain
    • Identify relationships between resources
  2. What operations need to be performed on these resources?
    • Define the “verbs” in your domain
    • Map out typical operational sequences
  3. What constitutes “healthy” or “unhealthy” states?
    • Define success criteria
    • Identify failure modes and recovery strategies

For example, when building a database operator, I might define:

Resources: - DatabaseCluster (top-level resource) - DatabaseInstance (individual database nodes) - BackupSchedule - BackupSnapshot

Operations: - Provisioning - Scaling - Backup/Restore - Version Upgrades - Failover

Health Criteria: - Primary node available - Replication functioning - Backups completing successfully - Performance within thresholds

Define Custom Resources

With the domain model in place, I design the Custom Resource Definitions (CRDs) that will represent these concepts in Kubernetes:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databaseclusters.database.example.com
spec:
  group: database.example.com
  names:
    kind: DatabaseCluster
    listKind: DatabaseClusterList
    plural: databaseclusters
    singular: databasecluster
    shortNames:
    - dbc
  scope: Namespaced
  versions:
  - name: v1alpha1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            required: ["version", "replicas"]
            properties:
              version:
                type: string
                description: "Database engine version"
              replicas:
                type: integer
                minimum: 1
                description: "Number of database instances"
              storage:
                type: object
                properties:
                  size:
                    type: string
                    pattern: "^[0-9]+(Gi|Ti)$"
                  storageClass:
                    type: string
              backup:
                type: object
                properties:
                  schedule:
                    type: string
                    pattern: "^(@(yearly|monthly|weekly|daily|hourly)|([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+))$"
                  retention:
                    type: string
                    pattern: "^[0-9]+(d|w|m|y)$"

Scope Definition

I carefully define the scope of the operator’s responsibilities:

  1. What the operator WILL do:
    • Clear, specific operational tasks
    • Well-defined failure handling
  2. What the operator WON’T do:
    • Explicit exclusions to prevent scope creep
    • Integration boundaries with other systems
  3. User personas and stories:
    • Who will use this operator?
    • What problems does it solve for them?

This scope definition becomes the foundation for both implementation and testing.

Design Phase: Architecture Decisions

With a clear problem definition, I move to architectural design:

Controller Architecture

I consider several controller architectures based on complexity:

  1. Single controller: For simpler operators with one resource type
  2. Multi-controller: For complex operators managing multiple resources
  3. Hierarchical controllers: For operators with parent-child resource relationships

For example, in a database operator, I might use a hierarchical approach: - Primary controller for DatabaseCluster resources - Secondary controllers for BackupSchedule and BackupSnapshot resources - Each with well-defined responsibilities and boundaries

State Management

One of the most critical design decisions is how to manage state:

  1. Kubernetes-native state:
    • Store all state in resource status, annotations, or labels
    • Pros: No external dependencies, fits Kubernetes paradigm
    • Cons: Limited state capacity, eventual consistency challenges
  2. External state store:
    • Use external databases or key-value stores for complex state
    • Pros: Better for complex state management, potentially more consistent
    • Cons: Additional dependency, more complex deployment

In most cases, I prefer Kubernetes-native state management unless the state is truly complex or voluminous.

Reconciliation Strategy

I design the reconciliation loop with careful consideration of:

  1. Reconciliation frequency:
    • Event-driven for responsive operations
    • Periodic for eventual consistency and drift detection
    • Often a combination of both
  2. Idempotency:
    • Ensure operations can be safely repeated
    • Design for at-least-once delivery semantics
  3. Concurrency control:
    • Resource locking or optimistic concurrency
    • Sequential vs. parallel reconciliation

Technical Stack Selection

I choose the technical stack based on project requirements:

  1. Framework selection:
    • Operator SDK (Go): For most production operators
    • Kopf (Python): For internal or simpler operators
    • KUDO: For operators with rich UI requirements
  2. Dependency management:
    • Minimize external dependencies
    • Clear versioning strategy for Kubernetes APIs

Here’s an example project structure for a Go-based operator:

my-operator/
├── api/
│   └── v1alpha1/
│       ├── databasecluster_types.go
│       └── zz_generated.deepcopy.go
├── controllers/
│   ├── databasecluster_controller.go
│   └── suite_test.go
├── pkg/
│   ├── engine/
│   │   ├── backup.go
│   │   ├── instance.go
│   │   └── monitoring.go
│   └── utils/
│       ├── health.go
│       └── status.go
├── config/
│   ├── crd/
│   ├── rbac/
│   └── manager/
├── Dockerfile
├── go.mod
└── main.go

Development Phase: Implementation Best Practices

With the architecture defined, I move to implementation:

Controller Patterns That Work

I follow these patterns for controller implementation:

1. Clean Reconciliation Loop

Keep the main reconciliation loop clean and focused:

func (r *DatabaseClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := r.Log.WithValues("databasecluster", req.NamespacedName)
    
    // 1. Fetch the resource
    var databaseCluster dbv1alpha1.DatabaseCluster
    if err := r.Get(ctx, req.NamespacedName, &databaseCluster); err != nil {
        if client.IgnoreNotFound(err) == nil {
            // Resource deleted - no requeue
            return ctrl.Result{}, nil
        }
        log.Error(err, "Unable to fetch DatabaseCluster")
        return ctrl.Result{}, err
    }
    
    // 2. Initialize or update status if needed
    if r.initializeStatus(ctx, &databaseCluster) {
        return ctrl.Result{Requeue: true}, nil
    }
    
    // 3. Validation
    if err := r.validateDatabaseCluster(ctx, &databaseCluster); err != nil {
        log.Error(err, "Validation failed")
        r.recordFailedValidationEvent(&databaseCluster, err)
        r.updateStatusCondition(ctx, &databaseCluster, dbv1alpha1.ConditionFailed, err.Error())
        return ctrl.Result{}, err
    }
    
    // 4. Main reconciliation logic - broken into clear steps
    if err := r.reconcileSecret(ctx, &databaseCluster); err != nil {
        return r.handleReconcileError(ctx, &databaseCluster, "Secret", err)
    }
    
    if err := r.reconcileConfigMap(ctx, &databaseCluster); err != nil {
        return r.handleReconcileError(ctx, &databaseCluster, "ConfigMap", err)
    }
    
    if err := r.reconcileStatefulSet(ctx, &databaseCluster); err != nil {
        return r.handleReconcileError(ctx, &databaseCluster, "StatefulSet", err)
    }
    
    if err := r.reconcileService(ctx, &databaseCluster); err != nil {
        return r.handleReconcileError(ctx, &databaseCluster, "Service", err)
    }
    
    // 5. Status update
    r.updateStatusCondition(ctx, &databaseCluster, dbv1alpha1.ConditionReady, "")
    
    // 6. Schedule next reconciliation
    return ctrl.Result{RequeueAfter: r.reconcilePeriod}, nil
}

2. Finite State Machine

For complex operators, I implement a clear state machine:

func (r *DatabaseClusterReconciler) reconcileStatefulSet(ctx context.Context, dbc *dbv1alpha1.DatabaseCluster) error {
    log := r.Log.WithValues("databasecluster", client.ObjectKeyFromObject(dbc))
    
    // State machine logic
    switch dbc.Status.Phase {
    case dbv1alpha1.PhaseNone:
        log.Info("Initializing database cluster")
        // Update phase and requeue
        dbc.Status.Phase = dbv1alpha1.PhaseInitializing
        if err := r.Status().Update(ctx, dbc); err != nil {
            return err
        }
        return nil
    
    case dbv1alpha1.PhaseInitializing:
        // Create initial statefulset
        if err := r.createInitialStatefulSet(ctx, dbc); err != nil {
            return err
        }
        dbc.Status.Phase = dbv1alpha1.PhaseBootstrapping
        if err := r.Status().Update(ctx, dbc); err != nil {
            return err
        }
        return nil
    
    case dbv1alpha1.PhaseBootstrapping:
        // Implement bootstrap logic
        if ready, err := r.checkBootstrapStatus(ctx, dbc); err != nil {
            return err
        } else if ready {
            dbc.Status.Phase = dbv1alpha1.PhaseRunning
            if err := r.Status().Update(ctx, dbc); err != nil {
                return err
            }
        }
        return nil
    
    case dbv1alpha1.PhaseRunning:
        // Normal operation - check for updates/changes
        return r.reconcileRunningStatefulSet(ctx, dbc)
    
    case dbv1alpha1.PhaseUpgrading:
        // Handle upgrade process
        return r.handleUpgradeProcess(ctx, dbc)
    
    default:
        log.Error(fmt.Errorf("unknown phase"), "Unexpected phase", "phase", dbc.Status.Phase)
        return fmt.Errorf("unknown phase: %s", dbc.Status.Phase)
    }
}

3. Status Management

I implement comprehensive status management:

func (r *DatabaseClusterReconciler) updateStatusCondition(ctx context.Context, dbc *dbv1alpha1.DatabaseCluster, 
    conditionType dbv1alpha1.ConditionType, message string) error {
    
    // Find existing condition
    var condition *dbv1alpha1.DatabaseClusterCondition
    for i := range dbc.Status.Conditions {
        if dbc.Status.Conditions[i].Type == conditionType {
            condition = &dbc.Status.Conditions[i]
            break
        }
    }
    
    // If not found, create it
    if condition == nil {
        dbc.Status.Conditions = append(dbc.Status.Conditions, dbv1alpha1.DatabaseClusterCondition{
            Type: conditionType,
        })
        condition = &dbc.Status.Conditions[len(dbc.Status.Conditions)-1]
    }
    
    // Update condition
    now := metav1.Now()
    if message == "" {
        condition.Status = metav1.ConditionTrue
        condition.LastTransitionTime = now
        condition.Message = ""
    } else {
        condition.Status = metav1.ConditionFalse
        condition.LastTransitionTime = now
        condition.Message = message
    }
    
    // Update the resource status
    return r.Status().Update(ctx, dbc)
}

Reconciliation Loop Best Practices

I follow these best practices for stable, maintainable reconciliation loops:

1. Idempotency

Every operation must be idempotent:

func (r *DatabaseClusterReconciler) reconcileConfigMap(ctx context.Context, dbc *dbv1alpha1.DatabaseCluster) error {
    // Define the desired ConfigMap
    desired := &corev1.ConfigMap{
        ObjectMeta: metav1.ObjectMeta{
            Name:      getConfigMapName(dbc),
            Namespace: dbc.Namespace,
            Labels:    getLabels(dbc),
            OwnerReferences: []metav1.OwnerReference{
                *metav1.NewControllerRef(dbc, schema.GroupVersionKind{
                    Group:   dbv1alpha1.GroupVersion.Group,
                    Version: dbv1alpha1.GroupVersion.Version,
                    Kind:    "DatabaseCluster",
                }),
            },
        },
        Data: map[string]string{
            "db.conf": generateDBConfig(dbc),
        },
    }
    
    // Check if it already exists
    existing := &corev1.ConfigMap{}
    err := r.Get(ctx, types.NamespacedName{Name: desired.Name, Namespace: desired.Namespace}, existing)
    
    if err != nil && errors.IsNotFound(err) {
        // Create it if it doesn't exist
        if err = r.Create(ctx, desired); err != nil {
            return fmt.Errorf("failed to create ConfigMap: %w", err)
        }
        return nil
    } else if err != nil {
        return fmt.Errorf("failed to get ConfigMap: %w", err)
    }
    
    // Compare and update if needed
    if !reflect.DeepEqual(existing.Data, desired.Data) || !reflect.DeepEqual(existing.Labels, desired.Labels) {
        existing.Data = desired.Data
        existing.Labels = desired.Labels
        if err = r.Update(ctx, existing); err != nil {
            return fmt.Errorf("failed to update ConfigMap: %w", err)
        }
    }
    
    return nil
}

2. Error Handling

Implement consistent error handling:

func (r *DatabaseClusterReconciler) handleReconcileError(
    ctx context.Context, 
    dbc *dbv1alpha1.DatabaseCluster, 
    component string, 
    err error,
) (ctrl.Result, error) {
    log := r.Log.WithValues("databasecluster", client.ObjectKeyFromObject(dbc))
    log.Error(err, "Reconciliation failed", "component", component)
    
    // Record event
    r.Recorder.Event(dbc, corev1.EventTypeWarning, 
        "Failed"+component, 
        fmt.Sprintf("Failed to reconcile %s: %v", component, err),
    )
    
    // Update status
    updateErr := r.updateStatusCondition(ctx, dbc, dbv1alpha1.ConditionReady, 
        fmt.Sprintf("Failed to reconcile %s: %v", component, err))
    
    if updateErr != nil {
        log.Error(updateErr, "Failed to update status")
        // If we can't update status, return both errors
        return ctrl.Result{}, fmt.Errorf("original error: %v, status update error: %v", err, updateErr)
    }
    
    // For some errors, we don't want to requeue immediately
    if errors.IsConflict(err) || errors.IsAlreadyExists(err) {
        return ctrl.Result{RequeueAfter: time.Second * 5}, nil
    }
    
    return ctrl.Result{}, err
}

3. Dependency Management

I ensure clean dependency management by:

// Define an interface for database operations
type DatabaseEngine interface {
    CreateDatabase(ctx context.Context, name string) error
    CreateUser(ctx context.Context, username, password string) error
    GrantPermissions(ctx context.Context, username, database string, permissions []string) error
    // Other methods...
}

// Controller using the interface
type DatabaseClusterReconciler struct {
    client.Client
    Log         logr.Logger
    Scheme      *runtime.Scheme
    Recorder    record.EventRecorder
    DBEngine    DatabaseEngine
}

// Implementation for a specific database
type PostgresEngine struct {
    // Implementation details
}

func (p *PostgresEngine) CreateDatabase(ctx context.Context, name string) error {
    // Postgres-specific implementation
}

// In main.go, wire up the correct implementation
reconciler := &controllers.DatabaseClusterReconciler{
    Client:    mgr.GetClient(),
    Log:       ctrl.Log.WithName("controllers").WithName("DatabaseCluster"),
    Scheme:    mgr.GetScheme(),
    Recorder:  mgr.GetEventRecorderFor("databasecluster-controller"),
    DBEngine:  &controllers.PostgresEngine{},
}

Testing Strategies

I approach operator testing comprehensively:

Unit Testing

I unit test all core logic using mocks:

func TestReconcileConfigMap(t *testing.T) {
    // Setup
    dbc := &dbv1alpha1.DatabaseCluster{
        ObjectMeta: metav1.ObjectMeta{
            Name:      "test-db",
            Namespace: "default",
        },
        Spec: dbv1alpha1.DatabaseClusterSpec{
            Version:  "13.3",
            Replicas: 3,
        },
    }
    
    // Create a fake client
    objs := []client.Object{dbc}
    scheme := runtime.NewScheme()
    dbv1alpha1.AddToScheme(scheme)
    corev1.AddToScheme(scheme)
    fakeClient := fake.NewClientBuilder().WithScheme(scheme).WithObjects(objs...).Build()
    
    reconciler := &DatabaseClusterReconciler{
        Client:   fakeClient,
        Scheme:   scheme,
        Recorder: record.NewFakeRecorder(10),
    }
    
    // Execute
    err := reconciler.reconcileConfigMap(context.Background(), dbc)
    
    // Verify
    assert.NoError(t, err)
    
    // Check if ConfigMap was created
    configMap := &corev1.ConfigMap{}
    err = fakeClient.Get(context.Background(), 
        types.NamespacedName{Name: "test-db-config", Namespace: "default"}, 
        configMap)
    
    assert.NoError(t, err)
    assert.Equal(t, "test-db-config", configMap.Name)
    assert.Contains(t, configMap.Data, "db.conf")
}

Integration Testing

I use the Kubernetes controller-runtime envtest package for integration testing:

var (
    cfg       *rest.Config
    k8sClient client.Client
    testEnv   *envtest.Environment
    ctx       context.Context
    cancel    context.CancelFunc
)

func TestAPIs(t *testing.T) {
    RegisterFailHandler(Fail)
    RunSpecs(t, "Controller Suite")
}

var _ = BeforeSuite(func() {
    ctx, cancel = context.WithCancel(context.TODO())
    
    By("bootstrapping test environment")
    testEnv = &envtest.Environment{
        CRDDirectoryPaths:     []string{filepath.Join("..", "config", "crd", "bases")},
        ErrorIfCRDPathMissing: true,
    }
    
    var err error
    cfg, err = testEnv.Start()
    Expect(err).NotTo(HaveOccurred())
    Expect(cfg).NotTo(BeNil())
    
    err = dbv1alpha1.AddToScheme(scheme.Scheme)
    Expect(err).NotTo(HaveOccurred())
    
    k8sClient, err = client.New(cfg, client.Options{Scheme: scheme.Scheme})
    Expect(err).NotTo(HaveOccurred())
    Expect(k8sClient).NotTo(BeNil())
})

var _ = AfterSuite(func() {
    cancel()
    By("tearing down the test environment")
    err := testEnv.Stop()
    Expect(err).NotTo(HaveOccurred())
})

var _ = Describe("DatabaseCluster controller", func() {
    Context("When creating a DatabaseCluster", func() {
        It("Should create associated resources", func() {
            dbc := &dbv1alpha1.DatabaseCluster{
                ObjectMeta: metav1.ObjectMeta{
                    Name:      "test-db",
                    Namespace: "default",
                },
                Spec: dbv1alpha1.DatabaseClusterSpec{
                    Version:  "13.3",
                    Replicas: 3,
                },
            }
            
            Expect(k8sClient.Create(ctx, dbc)).Should(Succeed())
            
            // Wait for resources to be created
            configMap := &corev1.ConfigMap{}
            Eventually(func() bool {
                err := k8sClient.Get(ctx, types.NamespacedName{
                    Name: "test-db-config", Namespace: "default",
                }, configMap)
                return err == nil
            }, timeout, interval).Should(BeTrue())
            
            // Verify statefulset was created
            sts := &appsv1.StatefulSet{}
            Eventually(func() bool {
                err := k8sClient.Get(ctx, types.NamespacedName{
                    Name: "test-db", Namespace: "default",
                }, sts)
                return err == nil
            }, timeout, interval).Should(BeTrue())
            
            Expect(sts.Spec.Replicas).To(Equal(pointer.Int32Ptr(3)))
        })
    })
})

End-to-End Testing

I create actual Kubernetes clusters for end-to-end testing:

func TestOperatorEndToEnd(t *testing.T) {
    // Skip if not running in CI environment with K8s access
    if os.Getenv("RUN_E2E_TESTS") != "true" {
        t.Skip("Skipping E2E tests")
    }
    
    // Use the current context in kubeconfig
    config, err := clientcmd.BuildConfigFromFlags("", os.Getenv("KUBECONFIG"))
    if err != nil {
        t.Fatalf("Error building kubeconfig: %v", err)
    }
    
    // Create a clientset
    clientset, err := kubernetes.NewForConfig(config)
    if err != nil {
        t.Fatalf("Error creating clientset: %v", err)
    }
    
    // Create dynamic client for CRDs
    dynamicClient, err := dynamic.NewForConfig(config)
    if err != nil {
        t.Fatalf("Error creating dynamic client: %v", err)
    }
    
    // Create test namespace
    namespace := &corev1.Namespace{
        ObjectMeta: metav1.ObjectMeta{
            Name: "operator-e2e-test",
        },
    }
    _, err = clientset.CoreV1().Namespaces().Create(context.TODO(), namespace, metav1.CreateOptions{})
    if err != nil && !errors.IsAlreadyExists(err) {
        t.Fatalf("Error creating namespace: %v", err)
    }
    
    // Run the actual tests...
    t.Run("CreateDatabaseCluster", func(t *testing.T) {
        // Test creating a DatabaseCluster and verify it works
    })
    
    // Clean up
    err = clientset.CoreV1().Namespaces().Delete(context.TODO(), namespace.Name, metav1.DeleteOptions{})
    if err != nil {
        t.Fatalf("Error deleting namespace: %v", err)
    }
}

Chaos Testing

For production-critical operators, I implement chaos testing:

  1. Use tools like Chaos Mesh or Litmus to:
    • Kill operator pods
    • Disconnect from the API server
    • Simulate network partitions
  2. Verify recovery behavior:
    • Does the operator resume reconciliation?
    • Are resources eventually consistent?
    • How does it handle conflicting changes?

Deployment and Monitoring Considerations

Deploying and monitoring operators requires careful planning:

Operator Deployment Strategies

I use a progressive deployment approach:

  1. Development:
    • Controller runs locally against dev cluster
    • Rapid iteration and testing
  2. Staging:
    • Deployed via CI/CD pipeline
    • Mimics production setup
    • Integration and end-to-end testing
  3. Production:
    • Deployment with appropriate replicas
    • RBAC configuration with least privilege
    • Resource limits and requests

Example Helm chart values for production:

# Production values for database-operator
replicaCount: 2

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 200m
    memory: 256Mi

nodeSelector:
  node-role.kubernetes.io/control-plane: ""

tolerations:
- key: "node-role.kubernetes.io/control-plane"
  operator: "Exists"
  effect: "NoSchedule"

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app.kubernetes.io/name
          operator: In
          values:
          - database-operator
      topologyKey: "kubernetes.io/hostname"

rbac:
  create: true
  clusterRole: true

Monitoring and Alerting

I implement comprehensive monitoring for operators:

  1. Prometheus metrics:
    • Reconciliation counts and durations
    • Error counts by type
    • Resource counts by state
// Define metrics
var (
    reconcileTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "controller_reconcile_total",
            Help: "The total number of reconciliations",
        },
        []string{"controller", "result"},
    )
    
    reconcileDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "controller_reconcile_duration_seconds",
            Help:    "The duration of reconciliations",
            Buckets: prometheus.DefBuckets,
        },
        []string{"controller"},
    )
    
    resourcesTotal = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "controller_resources_total",
            Help: "The total number of resources by state",
        },
        []string{"controller", "kind", "state"},
    )
)

// Instrument the reconciliation loop
func (r *DatabaseClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    controllerName := "databasecluster"
    timer := prometheus.NewTimer(reconcileDuration.WithLabelValues(controllerName))
    defer timer.ObserveDuration()
    
    // Reconciliation logic...
    
    if err != nil {
        reconcileTotal.WithLabelValues(controllerName, "error").Inc()
        return ctrl.Result{}, err
    }
    
    reconcileTotal.WithLabelValues(controllerName, "success").Inc()
    return ctrl.Result{}, nil
}
  1. Log levels and formats:
    • Structured logging
    • Dynamic log levels
    • Correlation IDs for tracking reconciliation
// Set up structured logging
func setupLogger() logr.Logger {
    var opts zap.Options
    opts.Development = false
    opts.EncoderConfigOptions = append(
        opts.EncoderConfigOptions,
        func(c *zapcore.EncoderConfig) {
            c.TimeKey = "timestamp"
            c.EncodeTime = zapcore.ISO8601TimeEncoder
        },
    )
    
    return zapr.NewLogger(zap.New(zap.UseFlagOptions(&opts)))
}
  1. Alerting rules:
    • Error rate thresholds
    • Reconciliation stalls
    • Resource drift detection
groups:
- name: DatabaseOperatorAlerts
  rules:
  - alert: DatabaseOperatorHighErrorRate
    expr: rate(controller_reconcile_total{controller="databasecluster",result="error"}[5m]) > 0.1
    for: 10m
    annotations:
      summary: "Database operator high error rate"
      description: "Database operator has a high error rate for the last 10 minutes"
  
  - alert: DatabaseOperatorReconciliationStalled
    expr: count(time() - on(namespace, name) max_over_time(controller_resource_last_reconcile_time{controller="databasecluster"}[1h]) > 3600) > 0
    for: 5m
    annotations:
      summary: "Database reconciliation stalled"
      description: "Some database resources haven't been reconciled in over an hour"

Case Study: The Cross-Cluster Configuration Operator I Built

To illustrate these principles, I’ll share a real-world operator I developed for a financial services client:

Problem Statement

The organization needed to enforce consistent configuration policies across 40+ Kubernetes clusters spanning on-premise and multiple cloud providers. Manual configuration was error-prone and time-consuming.

Solution: ConfigSync Operator

I designed and built a ConfigSync operator that:

  1. Managed a central “source of truth” for configurations:
    • Security policies
    • Resource quotas
    • Network policies
    • Admission control
  2. Supported multi-level inheritance:
    • Global baseline
    • Environment-specific (prod, dev, test)
    • Region-specific
    • Cluster-specific
  3. Provided policy validation and enforcement:
    • Pre-deployment validation
    • Drift detection
    • Automatic remediation

Implementation Details

The operator consisted of:

  1. Central ConfigStore CRD:

    apiVersion: config.example.com/v1alpha1
    kind: ConfigStore
    metadata:
      name: global-configs
    spec:
      source:
        git:
          repository: https://github.com/company/k8s-configs
          branch: main
          path: /global
      schedule: "*/30 * * * *"
      validation:
        mode: Strict  # Strict, Warn, or None
  2. ConfigSync CRD:

    apiVersion: config.example.com/v1alpha1
    kind: ConfigSync
    metadata:
      name: cluster-config-sync
    spec:
      sources:
      - name: global
        configStore: global-configs
      - name: environment
        configStore: prod-configs
      - name: regional
        configStore: us-east-configs
      targets:
      - kind: NetworkPolicy
        apiGroup: networking.k8s.io
        namespaces:
        - "*"
      - kind: ResourceQuota
        apiGroup: ""
        namespaces:
        - default
        - kube-system
      remediation:
        mode: Apply  # Apply, Report, or None
  3. Controller Implementation:

    • Periodically fetched configurations from Git
    • Dynamically built templates with context-specific values
    • Validated and applied configurations to target clusters
    • Reported compliance status and drift
  4. GitOps Integration:

    • Config changes triggered CI validation
    • Deployment approval workflow
    • Automatic rollback on validation failures

Results

The operator transformed the organization’s configuration management:

  1. Reduced configuration errors by 94%
  2. Decreased time-to-deploy for policy changes from days to minutes
  3. Improved security posture with consistent enforcement
  4. Simplified audit compliance with comprehensive reporting

Conclusion

Building Kubernetes operators is both an art and a science. The most successful operators strike a balance between solving real operational challenges and maintaining simplicity and reliability.

As you approach your own operator development, remember:

  1. Start with a clear problem definition:
    • Is an operator the right solution?
    • What specific operational challenges will it solve?
  2. Design for the Kubernetes ecosystem:
    • Follow Kubernetes patterns and principles
    • Leverage the declarative model
    • Build for eventual consistency
  3. Implement with a focus on reliability:
    • Comprehensive testing
    • Proper error handling
    • Observability from day one
  4. Evolve gradually:
    • Start with a minimal viable operator
    • Add capabilities based on actual needs
    • Maintain backward compatibility

By following the approaches outlined in this playbook, you’ll be well-equipped to build operators that not only solve real problems but do so reliably and maintainably.


For further discussion or questions about operator development, feel free to reach out. I’m always interested in hearing about interesting operator use cases and implementation challenges.