Jagadesh - Kubernetes Expert
My Kubernetes Operator Development Playbook
Introduction: When Custom Operators Make Sense
After working with Kubernetes across multiple distributions and enterprises, I’ve found that the true power of Kubernetes often lies not in its out-of-the-box capabilities, but in its extensibility. Custom operators—Kubernetes-native applications that extend the platform’s functionality—have been among the most powerful tools in my arsenal for solving complex operational challenges.
As a Go developer with extensive Kubernetes experience, I’ve built operators that manage everything from databases to multi-cluster configurations. Go is the natural choice for operator development, being the language of Kubernetes itself, offering excellent concurrency support, strong typing, and a rich ecosystem of Kubernetes libraries.
However, building operators is not a trivial undertaking. They’re effectively distributed systems components that need to be resilient, performant, and secure. Over the years, I’ve developed a systematic approach to operator development that balances technical elegance with practical business value, leveraging Go’s strengths throughout.
In this article, I’ll share my operator development playbook, refined through the creation of over a dozen production operators across various industries. Whether you’re a Go developer looking to extend Kubernetes or a platform engineer considering your first operator, these patterns and practices should help you build more effective, maintainable Kubernetes extensions.
When to Build an Operator (and When Not To)
Before diving into implementation, it’s crucial to determine whether an operator is the right solution. I evaluate potential operator projects using the following criteria:
Good Candidates for Operators
- Stateful applications with complex lifecycle
management:
- Databases with backup, restore, and scaling operations
- Message queues with partitioning and rebalancing needs
- Distributed systems with peer discovery requirements
- Cross-cutting operational concerns:
- Multi-cluster configuration synchronization
- Network policy management
- Certificate management and rotation
- Domain-specific operational patterns:
- Industry-specific compliance automation
- Custom deployment strategies
- Specialized health checks and remediation
Poor Candidates for Operators
- Simple stateless applications:
- Better served by Deployments/StatefulSets with appropriate probes
- One-off automation tasks:
- Better implemented as Jobs or CronJobs
- UI-heavy workflows:
- Often better as separate applications with Kubernetes API integration
Decision Framework
I use a simple scoring matrix to evaluate operator candidates:
| Criterion | Weight | Score (1-5) |
|---|---|---|
| Repetitive manual operations | 3 | ? |
| Complex lifecycle management | 3 | ? |
| Well-defined operational model | 2 | ? |
| Clear domain boundaries | 2 | ? |
| Need for Kubernetes-native integration | 2 | ? |
| Ongoing development resources | 1 | ? |
Projects scoring above 40 (out of 65) typically make good operator candidates.
Planning Phase: Problem Definition and Scope
Once I’ve decided an operator is the right approach, I begin with a thorough planning phase:
Define the Domain Model
First, I map out the domain model by asking:
- What are the core resources being managed?
- Define the “nouns” in your domain
- Identify relationships between resources
- What operations need to be performed on these
resources?
- Define the “verbs” in your domain
- Map out typical operational sequences
- What constitutes “healthy” or “unhealthy” states?
- Define success criteria
- Identify failure modes and recovery strategies
For example, when building a database operator, I might define:
Resources: - DatabaseCluster (top-level resource) - DatabaseInstance (individual database nodes) - BackupSchedule - BackupSnapshot
Operations: - Provisioning - Scaling - Backup/Restore - Version Upgrades - Failover
Health Criteria: - Primary node available - Replication functioning - Backups completing successfully - Performance within thresholds
Define Custom Resources
With the domain model in place, I design the Custom Resource Definitions (CRDs) that will represent these concepts in Kubernetes:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: databaseclusters.database.example.com
spec:
group: database.example.com
names:
kind: DatabaseCluster
listKind: DatabaseClusterList
plural: databaseclusters
singular: databasecluster
shortNames:
- dbc
scope: Namespaced
versions:
- name: v1alpha1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
required: ["version", "replicas"]
properties:
version:
type: string
description: "Database engine version"
replicas:
type: integer
minimum: 1
description: "Number of database instances"
storage:
type: object
properties:
size:
type: string
pattern: "^[0-9]+(Gi|Ti)$"
storageClass:
type: string
backup:
type: object
properties:
schedule:
type: string
pattern: "^(@(yearly|monthly|weekly|daily|hourly)|([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+))$"
retention:
type: string
pattern: "^[0-9]+(d|w|m|y)$"Scope Definition
I carefully define the scope of the operator’s responsibilities:
- What the operator WILL do:
- Clear, specific operational tasks
- Well-defined failure handling
- What the operator WON’T do:
- Explicit exclusions to prevent scope creep
- Integration boundaries with other systems
- User personas and stories:
- Who will use this operator?
- What problems does it solve for them?
This scope definition becomes the foundation for both implementation and testing.
Design Phase: Architecture Decisions
With a clear problem definition, I move to architectural design:
Controller Architecture
I consider several controller architectures based on complexity:
- Single controller: For simpler operators with one resource type
- Multi-controller: For complex operators managing multiple resources
- Hierarchical controllers: For operators with parent-child resource relationships
For example, in a database operator, I might use a hierarchical approach: - Primary controller for DatabaseCluster resources - Secondary controllers for BackupSchedule and BackupSnapshot resources - Each with well-defined responsibilities and boundaries
State Management
One of the most critical design decisions is how to manage state:
- Kubernetes-native state:
- Store all state in resource status, annotations, or labels
- Pros: No external dependencies, fits Kubernetes paradigm
- Cons: Limited state capacity, eventual consistency challenges
- External state store:
- Use external databases or key-value stores for complex state
- Pros: Better for complex state management, potentially more consistent
- Cons: Additional dependency, more complex deployment
In most cases, I prefer Kubernetes-native state management unless the state is truly complex or voluminous.
Reconciliation Strategy
I design the reconciliation loop with careful consideration of:
- Reconciliation frequency:
- Event-driven for responsive operations
- Periodic for eventual consistency and drift detection
- Often a combination of both
- Idempotency:
- Ensure operations can be safely repeated
- Design for at-least-once delivery semantics
- Concurrency control:
- Resource locking or optimistic concurrency
- Sequential vs. parallel reconciliation
Technical Stack Selection
I choose the technical stack based on project requirements:
- Framework selection:
- Operator SDK (Go): For most production operators
- Kopf (Python): For internal or simpler operators
- KUDO: For operators with rich UI requirements
- Dependency management:
- Minimize external dependencies
- Clear versioning strategy for Kubernetes APIs
Here’s an example project structure for a Go-based operator:
my-operator/
├── api/
│ └── v1alpha1/
│ ├── databasecluster_types.go
│ └── zz_generated.deepcopy.go
├── controllers/
│ ├── databasecluster_controller.go
│ └── suite_test.go
├── pkg/
│ ├── engine/
│ │ ├── backup.go
│ │ ├── instance.go
│ │ └── monitoring.go
│ └── utils/
│ ├── health.go
│ └── status.go
├── config/
│ ├── crd/
│ ├── rbac/
│ └── manager/
├── Dockerfile
├── go.mod
└── main.go
Development Phase: Implementation Best Practices
With the architecture defined, I move to implementation:
Controller Patterns That Work
I follow these patterns for controller implementation:
1. Clean Reconciliation Loop
Keep the main reconciliation loop clean and focused:
func (r *DatabaseClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := r.Log.WithValues("databasecluster", req.NamespacedName)
// 1. Fetch the resource
var databaseCluster dbv1alpha1.DatabaseCluster
if err := r.Get(ctx, req.NamespacedName, &databaseCluster); err != nil {
if client.IgnoreNotFound(err) == nil {
// Resource deleted - no requeue
return ctrl.Result{}, nil
}
log.Error(err, "Unable to fetch DatabaseCluster")
return ctrl.Result{}, err
}
// 2. Initialize or update status if needed
if r.initializeStatus(ctx, &databaseCluster) {
return ctrl.Result{Requeue: true}, nil
}
// 3. Validation
if err := r.validateDatabaseCluster(ctx, &databaseCluster); err != nil {
log.Error(err, "Validation failed")
r.recordFailedValidationEvent(&databaseCluster, err)
r.updateStatusCondition(ctx, &databaseCluster, dbv1alpha1.ConditionFailed, err.Error())
return ctrl.Result{}, err
}
// 4. Main reconciliation logic - broken into clear steps
if err := r.reconcileSecret(ctx, &databaseCluster); err != nil {
return r.handleReconcileError(ctx, &databaseCluster, "Secret", err)
}
if err := r.reconcileConfigMap(ctx, &databaseCluster); err != nil {
return r.handleReconcileError(ctx, &databaseCluster, "ConfigMap", err)
}
if err := r.reconcileStatefulSet(ctx, &databaseCluster); err != nil {
return r.handleReconcileError(ctx, &databaseCluster, "StatefulSet", err)
}
if err := r.reconcileService(ctx, &databaseCluster); err != nil {
return r.handleReconcileError(ctx, &databaseCluster, "Service", err)
}
// 5. Status update
r.updateStatusCondition(ctx, &databaseCluster, dbv1alpha1.ConditionReady, "")
// 6. Schedule next reconciliation
return ctrl.Result{RequeueAfter: r.reconcilePeriod}, nil
}2. Finite State Machine
For complex operators, I implement a clear state machine:
func (r *DatabaseClusterReconciler) reconcileStatefulSet(ctx context.Context, dbc *dbv1alpha1.DatabaseCluster) error {
log := r.Log.WithValues("databasecluster", client.ObjectKeyFromObject(dbc))
// State machine logic
switch dbc.Status.Phase {
case dbv1alpha1.PhaseNone:
log.Info("Initializing database cluster")
// Update phase and requeue
dbc.Status.Phase = dbv1alpha1.PhaseInitializing
if err := r.Status().Update(ctx, dbc); err != nil {
return err
}
return nil
case dbv1alpha1.PhaseInitializing:
// Create initial statefulset
if err := r.createInitialStatefulSet(ctx, dbc); err != nil {
return err
}
dbc.Status.Phase = dbv1alpha1.PhaseBootstrapping
if err := r.Status().Update(ctx, dbc); err != nil {
return err
}
return nil
case dbv1alpha1.PhaseBootstrapping:
// Implement bootstrap logic
if ready, err := r.checkBootstrapStatus(ctx, dbc); err != nil {
return err
} else if ready {
dbc.Status.Phase = dbv1alpha1.PhaseRunning
if err := r.Status().Update(ctx, dbc); err != nil {
return err
}
}
return nil
case dbv1alpha1.PhaseRunning:
// Normal operation - check for updates/changes
return r.reconcileRunningStatefulSet(ctx, dbc)
case dbv1alpha1.PhaseUpgrading:
// Handle upgrade process
return r.handleUpgradeProcess(ctx, dbc)
default:
log.Error(fmt.Errorf("unknown phase"), "Unexpected phase", "phase", dbc.Status.Phase)
return fmt.Errorf("unknown phase: %s", dbc.Status.Phase)
}
}3. Status Management
I implement comprehensive status management:
func (r *DatabaseClusterReconciler) updateStatusCondition(ctx context.Context, dbc *dbv1alpha1.DatabaseCluster,
conditionType dbv1alpha1.ConditionType, message string) error {
// Find existing condition
var condition *dbv1alpha1.DatabaseClusterCondition
for i := range dbc.Status.Conditions {
if dbc.Status.Conditions[i].Type == conditionType {
condition = &dbc.Status.Conditions[i]
break
}
}
// If not found, create it
if condition == nil {
dbc.Status.Conditions = append(dbc.Status.Conditions, dbv1alpha1.DatabaseClusterCondition{
Type: conditionType,
})
condition = &dbc.Status.Conditions[len(dbc.Status.Conditions)-1]
}
// Update condition
now := metav1.Now()
if message == "" {
condition.Status = metav1.ConditionTrue
condition.LastTransitionTime = now
condition.Message = ""
} else {
condition.Status = metav1.ConditionFalse
condition.LastTransitionTime = now
condition.Message = message
}
// Update the resource status
return r.Status().Update(ctx, dbc)
}Reconciliation Loop Best Practices
I follow these best practices for stable, maintainable reconciliation loops:
1. Idempotency
Every operation must be idempotent:
func (r *DatabaseClusterReconciler) reconcileConfigMap(ctx context.Context, dbc *dbv1alpha1.DatabaseCluster) error {
// Define the desired ConfigMap
desired := &corev1.ConfigMap{
ObjectMeta: metav1.ObjectMeta{
Name: getConfigMapName(dbc),
Namespace: dbc.Namespace,
Labels: getLabels(dbc),
OwnerReferences: []metav1.OwnerReference{
*metav1.NewControllerRef(dbc, schema.GroupVersionKind{
Group: dbv1alpha1.GroupVersion.Group,
Version: dbv1alpha1.GroupVersion.Version,
Kind: "DatabaseCluster",
}),
},
},
Data: map[string]string{
"db.conf": generateDBConfig(dbc),
},
}
// Check if it already exists
existing := &corev1.ConfigMap{}
err := r.Get(ctx, types.NamespacedName{Name: desired.Name, Namespace: desired.Namespace}, existing)
if err != nil && errors.IsNotFound(err) {
// Create it if it doesn't exist
if err = r.Create(ctx, desired); err != nil {
return fmt.Errorf("failed to create ConfigMap: %w", err)
}
return nil
} else if err != nil {
return fmt.Errorf("failed to get ConfigMap: %w", err)
}
// Compare and update if needed
if !reflect.DeepEqual(existing.Data, desired.Data) || !reflect.DeepEqual(existing.Labels, desired.Labels) {
existing.Data = desired.Data
existing.Labels = desired.Labels
if err = r.Update(ctx, existing); err != nil {
return fmt.Errorf("failed to update ConfigMap: %w", err)
}
}
return nil
}2. Error Handling
Implement consistent error handling:
func (r *DatabaseClusterReconciler) handleReconcileError(
ctx context.Context,
dbc *dbv1alpha1.DatabaseCluster,
component string,
err error,
) (ctrl.Result, error) {
log := r.Log.WithValues("databasecluster", client.ObjectKeyFromObject(dbc))
log.Error(err, "Reconciliation failed", "component", component)
// Record event
r.Recorder.Event(dbc, corev1.EventTypeWarning,
"Failed"+component,
fmt.Sprintf("Failed to reconcile %s: %v", component, err),
)
// Update status
updateErr := r.updateStatusCondition(ctx, dbc, dbv1alpha1.ConditionReady,
fmt.Sprintf("Failed to reconcile %s: %v", component, err))
if updateErr != nil {
log.Error(updateErr, "Failed to update status")
// If we can't update status, return both errors
return ctrl.Result{}, fmt.Errorf("original error: %v, status update error: %v", err, updateErr)
}
// For some errors, we don't want to requeue immediately
if errors.IsConflict(err) || errors.IsAlreadyExists(err) {
return ctrl.Result{RequeueAfter: time.Second * 5}, nil
}
return ctrl.Result{}, err
}3. Dependency Management
I ensure clean dependency management by:
- Using interfaces for external services
- Implementing dependency injection
- Creating mocks for testing
// Define an interface for database operations
type DatabaseEngine interface {
CreateDatabase(ctx context.Context, name string) error
CreateUser(ctx context.Context, username, password string) error
GrantPermissions(ctx context.Context, username, database string, permissions []string) error
// Other methods...
}
// Controller using the interface
type DatabaseClusterReconciler struct {
client.Client
Log logr.Logger
Scheme *runtime.Scheme
Recorder record.EventRecorder
DBEngine DatabaseEngine
}
// Implementation for a specific database
type PostgresEngine struct {
// Implementation details
}
func (p *PostgresEngine) CreateDatabase(ctx context.Context, name string) error {
// Postgres-specific implementation
}
// In main.go, wire up the correct implementation
reconciler := &controllers.DatabaseClusterReconciler{
Client: mgr.GetClient(),
Log: ctrl.Log.WithName("controllers").WithName("DatabaseCluster"),
Scheme: mgr.GetScheme(),
Recorder: mgr.GetEventRecorderFor("databasecluster-controller"),
DBEngine: &controllers.PostgresEngine{},
}Testing Strategies
I approach operator testing comprehensively:
Unit Testing
I unit test all core logic using mocks:
func TestReconcileConfigMap(t *testing.T) {
// Setup
dbc := &dbv1alpha1.DatabaseCluster{
ObjectMeta: metav1.ObjectMeta{
Name: "test-db",
Namespace: "default",
},
Spec: dbv1alpha1.DatabaseClusterSpec{
Version: "13.3",
Replicas: 3,
},
}
// Create a fake client
objs := []client.Object{dbc}
scheme := runtime.NewScheme()
dbv1alpha1.AddToScheme(scheme)
corev1.AddToScheme(scheme)
fakeClient := fake.NewClientBuilder().WithScheme(scheme).WithObjects(objs...).Build()
reconciler := &DatabaseClusterReconciler{
Client: fakeClient,
Scheme: scheme,
Recorder: record.NewFakeRecorder(10),
}
// Execute
err := reconciler.reconcileConfigMap(context.Background(), dbc)
// Verify
assert.NoError(t, err)
// Check if ConfigMap was created
configMap := &corev1.ConfigMap{}
err = fakeClient.Get(context.Background(),
types.NamespacedName{Name: "test-db-config", Namespace: "default"},
configMap)
assert.NoError(t, err)
assert.Equal(t, "test-db-config", configMap.Name)
assert.Contains(t, configMap.Data, "db.conf")
}Integration Testing
I use the Kubernetes controller-runtime envtest package for integration testing:
var (
cfg *rest.Config
k8sClient client.Client
testEnv *envtest.Environment
ctx context.Context
cancel context.CancelFunc
)
func TestAPIs(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "Controller Suite")
}
var _ = BeforeSuite(func() {
ctx, cancel = context.WithCancel(context.TODO())
By("bootstrapping test environment")
testEnv = &envtest.Environment{
CRDDirectoryPaths: []string{filepath.Join("..", "config", "crd", "bases")},
ErrorIfCRDPathMissing: true,
}
var err error
cfg, err = testEnv.Start()
Expect(err).NotTo(HaveOccurred())
Expect(cfg).NotTo(BeNil())
err = dbv1alpha1.AddToScheme(scheme.Scheme)
Expect(err).NotTo(HaveOccurred())
k8sClient, err = client.New(cfg, client.Options{Scheme: scheme.Scheme})
Expect(err).NotTo(HaveOccurred())
Expect(k8sClient).NotTo(BeNil())
})
var _ = AfterSuite(func() {
cancel()
By("tearing down the test environment")
err := testEnv.Stop()
Expect(err).NotTo(HaveOccurred())
})
var _ = Describe("DatabaseCluster controller", func() {
Context("When creating a DatabaseCluster", func() {
It("Should create associated resources", func() {
dbc := &dbv1alpha1.DatabaseCluster{
ObjectMeta: metav1.ObjectMeta{
Name: "test-db",
Namespace: "default",
},
Spec: dbv1alpha1.DatabaseClusterSpec{
Version: "13.3",
Replicas: 3,
},
}
Expect(k8sClient.Create(ctx, dbc)).Should(Succeed())
// Wait for resources to be created
configMap := &corev1.ConfigMap{}
Eventually(func() bool {
err := k8sClient.Get(ctx, types.NamespacedName{
Name: "test-db-config", Namespace: "default",
}, configMap)
return err == nil
}, timeout, interval).Should(BeTrue())
// Verify statefulset was created
sts := &appsv1.StatefulSet{}
Eventually(func() bool {
err := k8sClient.Get(ctx, types.NamespacedName{
Name: "test-db", Namespace: "default",
}, sts)
return err == nil
}, timeout, interval).Should(BeTrue())
Expect(sts.Spec.Replicas).To(Equal(pointer.Int32Ptr(3)))
})
})
})End-to-End Testing
I create actual Kubernetes clusters for end-to-end testing:
func TestOperatorEndToEnd(t *testing.T) {
// Skip if not running in CI environment with K8s access
if os.Getenv("RUN_E2E_TESTS") != "true" {
t.Skip("Skipping E2E tests")
}
// Use the current context in kubeconfig
config, err := clientcmd.BuildConfigFromFlags("", os.Getenv("KUBECONFIG"))
if err != nil {
t.Fatalf("Error building kubeconfig: %v", err)
}
// Create a clientset
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
t.Fatalf("Error creating clientset: %v", err)
}
// Create dynamic client for CRDs
dynamicClient, err := dynamic.NewForConfig(config)
if err != nil {
t.Fatalf("Error creating dynamic client: %v", err)
}
// Create test namespace
namespace := &corev1.Namespace{
ObjectMeta: metav1.ObjectMeta{
Name: "operator-e2e-test",
},
}
_, err = clientset.CoreV1().Namespaces().Create(context.TODO(), namespace, metav1.CreateOptions{})
if err != nil && !errors.IsAlreadyExists(err) {
t.Fatalf("Error creating namespace: %v", err)
}
// Run the actual tests...
t.Run("CreateDatabaseCluster", func(t *testing.T) {
// Test creating a DatabaseCluster and verify it works
})
// Clean up
err = clientset.CoreV1().Namespaces().Delete(context.TODO(), namespace.Name, metav1.DeleteOptions{})
if err != nil {
t.Fatalf("Error deleting namespace: %v", err)
}
}Chaos Testing
For production-critical operators, I implement chaos testing:
- Use tools like Chaos Mesh or Litmus to:
- Kill operator pods
- Disconnect from the API server
- Simulate network partitions
- Verify recovery behavior:
- Does the operator resume reconciliation?
- Are resources eventually consistent?
- How does it handle conflicting changes?
Deployment and Monitoring Considerations
Deploying and monitoring operators requires careful planning:
Operator Deployment Strategies
I use a progressive deployment approach:
- Development:
- Controller runs locally against dev cluster
- Rapid iteration and testing
- Staging:
- Deployed via CI/CD pipeline
- Mimics production setup
- Integration and end-to-end testing
- Production:
- Deployment with appropriate replicas
- RBAC configuration with least privilege
- Resource limits and requests
Example Helm chart values for production:
# Production values for database-operator
replicaCount: 2
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 200m
memory: 256Mi
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- database-operator
topologyKey: "kubernetes.io/hostname"
rbac:
create: true
clusterRole: trueMonitoring and Alerting
I implement comprehensive monitoring for operators:
- Prometheus metrics:
- Reconciliation counts and durations
- Error counts by type
- Resource counts by state
// Define metrics
var (
reconcileTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "controller_reconcile_total",
Help: "The total number of reconciliations",
},
[]string{"controller", "result"},
)
reconcileDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "controller_reconcile_duration_seconds",
Help: "The duration of reconciliations",
Buckets: prometheus.DefBuckets,
},
[]string{"controller"},
)
resourcesTotal = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "controller_resources_total",
Help: "The total number of resources by state",
},
[]string{"controller", "kind", "state"},
)
)
// Instrument the reconciliation loop
func (r *DatabaseClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
controllerName := "databasecluster"
timer := prometheus.NewTimer(reconcileDuration.WithLabelValues(controllerName))
defer timer.ObserveDuration()
// Reconciliation logic...
if err != nil {
reconcileTotal.WithLabelValues(controllerName, "error").Inc()
return ctrl.Result{}, err
}
reconcileTotal.WithLabelValues(controllerName, "success").Inc()
return ctrl.Result{}, nil
}- Log levels and formats:
- Structured logging
- Dynamic log levels
- Correlation IDs for tracking reconciliation
// Set up structured logging
func setupLogger() logr.Logger {
var opts zap.Options
opts.Development = false
opts.EncoderConfigOptions = append(
opts.EncoderConfigOptions,
func(c *zapcore.EncoderConfig) {
c.TimeKey = "timestamp"
c.EncodeTime = zapcore.ISO8601TimeEncoder
},
)
return zapr.NewLogger(zap.New(zap.UseFlagOptions(&opts)))
}- Alerting rules:
- Error rate thresholds
- Reconciliation stalls
- Resource drift detection
groups:
- name: DatabaseOperatorAlerts
rules:
- alert: DatabaseOperatorHighErrorRate
expr: rate(controller_reconcile_total{controller="databasecluster",result="error"}[5m]) > 0.1
for: 10m
annotations:
summary: "Database operator high error rate"
description: "Database operator has a high error rate for the last 10 minutes"
- alert: DatabaseOperatorReconciliationStalled
expr: count(time() - on(namespace, name) max_over_time(controller_resource_last_reconcile_time{controller="databasecluster"}[1h]) > 3600) > 0
for: 5m
annotations:
summary: "Database reconciliation stalled"
description: "Some database resources haven't been reconciled in over an hour"Case Study: The Cross-Cluster Configuration Operator I Built
To illustrate these principles, I’ll share a real-world operator I developed for a financial services client:
Problem Statement
The organization needed to enforce consistent configuration policies across 40+ Kubernetes clusters spanning on-premise and multiple cloud providers. Manual configuration was error-prone and time-consuming.
Solution: ConfigSync Operator
I designed and built a ConfigSync operator that:
- Managed a central “source of truth” for
configurations:
- Security policies
- Resource quotas
- Network policies
- Admission control
- Supported multi-level inheritance:
- Global baseline
- Environment-specific (prod, dev, test)
- Region-specific
- Cluster-specific
- Provided policy validation and enforcement:
- Pre-deployment validation
- Drift detection
- Automatic remediation
Implementation Details
The operator consisted of:
Central ConfigStore CRD:
apiVersion: config.example.com/v1alpha1 kind: ConfigStore metadata: name: global-configs spec: source: git: repository: https://github.com/company/k8s-configs branch: main path: /global schedule: "*/30 * * * *" validation: mode: Strict # Strict, Warn, or NoneConfigSync CRD:
apiVersion: config.example.com/v1alpha1 kind: ConfigSync metadata: name: cluster-config-sync spec: sources: - name: global configStore: global-configs - name: environment configStore: prod-configs - name: regional configStore: us-east-configs targets: - kind: NetworkPolicy apiGroup: networking.k8s.io namespaces: - "*" - kind: ResourceQuota apiGroup: "" namespaces: - default - kube-system remediation: mode: Apply # Apply, Report, or NoneController Implementation:
- Periodically fetched configurations from Git
- Dynamically built templates with context-specific values
- Validated and applied configurations to target clusters
- Reported compliance status and drift
GitOps Integration:
- Config changes triggered CI validation
- Deployment approval workflow
- Automatic rollback on validation failures
Results
The operator transformed the organization’s configuration management:
- Reduced configuration errors by 94%
- Decreased time-to-deploy for policy changes from days to minutes
- Improved security posture with consistent enforcement
- Simplified audit compliance with comprehensive reporting
Conclusion
Building Kubernetes operators is both an art and a science. The most successful operators strike a balance between solving real operational challenges and maintaining simplicity and reliability.
As you approach your own operator development, remember:
- Start with a clear problem definition:
- Is an operator the right solution?
- What specific operational challenges will it solve?
- Design for the Kubernetes ecosystem:
- Follow Kubernetes patterns and principles
- Leverage the declarative model
- Build for eventual consistency
- Implement with a focus on reliability:
- Comprehensive testing
- Proper error handling
- Observability from day one
- Evolve gradually:
- Start with a minimal viable operator
- Add capabilities based on actual needs
- Maintain backward compatibility
By following the approaches outlined in this playbook, you’ll be well-equipped to build operators that not only solve real problems but do so reliably and maintainably.
For further discussion or questions about operator development, feel free to reach out. I’m always interested in hearing about interesting operator use cases and implementation challenges.