After working with Kubernetes across multiple distributions and enterprises, I’ve found that the true power of Kubernetes often lies not in its out-of-the-box capabilities, but in its extensibility. Custom operators—Kubernetes-native applications that extend the platform’s functionality—have been among the most powerful tools in my arsenal for solving complex operational challenges.
However, building operators is not a trivial undertaking. They’re effectively distributed systems components that need to be resilient, performant, and secure. Over the years, I’ve developed a systematic approach to operator development that balances technical elegance with practical business value.
In this article, I’ll share my operator development playbook, refined through the creation of over a dozen production operators across various industries. Whether you’re considering your first operator or looking to improve your existing development process, these patterns and practices should help you build more effective, maintainable Kubernetes extensions.
Before diving into implementation, it’s crucial to determine whether an operator is the right solution. I evaluate potential operator projects using the following criteria:
I use a simple scoring matrix to evaluate operator candidates:
Criterion | Weight | Score (1-5) |
---|---|---|
Repetitive manual operations | 3 | ? |
Complex lifecycle management | 3 | ? |
Well-defined operational model | 2 | ? |
Clear domain boundaries | 2 | ? |
Need for Kubernetes-native integration | 2 | ? |
Ongoing development resources | 1 | ? |
Projects scoring above 40 (out of 65) typically make good operator candidates.
Once I’ve decided an operator is the right approach, I begin with a thorough planning phase:
First, I map out the domain model by asking:
For example, when building a database operator, I might define:
Resources: - DatabaseCluster (top-level resource) - DatabaseInstance (individual database nodes) - BackupSchedule - BackupSnapshot
Operations: - Provisioning - Scaling - Backup/Restore - Version Upgrades - Failover
Health Criteria: - Primary node available - Replication functioning - Backups completing successfully - Performance within thresholds
With the domain model in place, I design the Custom Resource Definitions (CRDs) that will represent these concepts in Kubernetes:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: databaseclusters.database.example.com
spec:
group: database.example.com
names:
kind: DatabaseCluster
listKind: DatabaseClusterList
plural: databaseclusters
singular: databasecluster
shortNames:
- dbc
scope: Namespaced
versions:
- name: v1alpha1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
required: ["version", "replicas"]
properties:
version:
type: string
description: "Database engine version"
replicas:
type: integer
minimum: 1
description: "Number of database instances"
storage:
type: object
properties:
size:
type: string
pattern: "^[0-9]+(Gi|Ti)$"
storageClass:
type: string
backup:
type: object
properties:
schedule:
type: string
pattern: "^(@(yearly|monthly|weekly|daily|hourly)|([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+))$"
retention:
type: string
pattern: "^[0-9]+(d|w|m|y)$"
I carefully define the scope of the operator’s responsibilities:
This scope definition becomes the foundation for both implementation and testing.
With a clear problem definition, I move to architectural design:
I consider several controller architectures based on complexity:
For example, in a database operator, I might use a hierarchical approach: - Primary controller for DatabaseCluster resources - Secondary controllers for BackupSchedule and BackupSnapshot resources - Each with well-defined responsibilities and boundaries
One of the most critical design decisions is how to manage state:
In most cases, I prefer Kubernetes-native state management unless the state is truly complex or voluminous.
I design the reconciliation loop with careful consideration of:
I choose the technical stack based on project requirements:
Here’s an example project structure for a Go-based operator:
my-operator/
├── api/
│ └── v1alpha1/
│ ├── databasecluster_types.go
│ └── zz_generated.deepcopy.go
├── controllers/
│ ├── databasecluster_controller.go
│ └── suite_test.go
├── pkg/
│ ├── engine/
│ │ ├── backup.go
│ │ ├── instance.go
│ │ └── monitoring.go
│ └── utils/
│ ├── health.go
│ └── status.go
├── config/
│ ├── crd/
│ ├── rbac/
│ └── manager/
├── Dockerfile
├── go.mod
└── main.go
With the architecture defined, I move to implementation:
I follow these patterns for controller implementation:
Keep the main reconciliation loop clean and focused:
func (r *DatabaseClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
:= r.Log.WithValues("databasecluster", req.NamespacedName)
log
// 1. Fetch the resource
var databaseCluster dbv1alpha1.DatabaseCluster
if err := r.Get(ctx, req.NamespacedName, &databaseCluster); err != nil {
if client.IgnoreNotFound(err) == nil {
// Resource deleted - no requeue
return ctrl.Result{}, nil
}
.Error(err, "Unable to fetch DatabaseCluster")
logreturn ctrl.Result{}, err
}
// 2. Initialize or update status if needed
if r.initializeStatus(ctx, &databaseCluster) {
return ctrl.Result{Requeue: true}, nil
}
// 3. Validation
if err := r.validateDatabaseCluster(ctx, &databaseCluster); err != nil {
.Error(err, "Validation failed")
log.recordFailedValidationEvent(&databaseCluster, err)
r.updateStatusCondition(ctx, &databaseCluster, dbv1alpha1.ConditionFailed, err.Error())
rreturn ctrl.Result{}, err
}
// 4. Main reconciliation logic - broken into clear steps
if err := r.reconcileSecret(ctx, &databaseCluster); err != nil {
return r.handleReconcileError(ctx, &databaseCluster, "Secret", err)
}
if err := r.reconcileConfigMap(ctx, &databaseCluster); err != nil {
return r.handleReconcileError(ctx, &databaseCluster, "ConfigMap", err)
}
if err := r.reconcileStatefulSet(ctx, &databaseCluster); err != nil {
return r.handleReconcileError(ctx, &databaseCluster, "StatefulSet", err)
}
if err := r.reconcileService(ctx, &databaseCluster); err != nil {
return r.handleReconcileError(ctx, &databaseCluster, "Service", err)
}
// 5. Status update
.updateStatusCondition(ctx, &databaseCluster, dbv1alpha1.ConditionReady, "")
r
// 6. Schedule next reconciliation
return ctrl.Result{RequeueAfter: r.reconcilePeriod}, nil
}
For complex operators, I implement a clear state machine:
func (r *DatabaseClusterReconciler) reconcileStatefulSet(ctx context.Context, dbc *dbv1alpha1.DatabaseCluster) error {
:= r.Log.WithValues("databasecluster", client.ObjectKeyFromObject(dbc))
log
// State machine logic
switch dbc.Status.Phase {
case dbv1alpha1.PhaseNone:
.Info("Initializing database cluster")
log// Update phase and requeue
.Status.Phase = dbv1alpha1.PhaseInitializing
dbcif err := r.Status().Update(ctx, dbc); err != nil {
return err
}
return nil
case dbv1alpha1.PhaseInitializing:
// Create initial statefulset
if err := r.createInitialStatefulSet(ctx, dbc); err != nil {
return err
}
.Status.Phase = dbv1alpha1.PhaseBootstrapping
dbcif err := r.Status().Update(ctx, dbc); err != nil {
return err
}
return nil
case dbv1alpha1.PhaseBootstrapping:
// Implement bootstrap logic
if ready, err := r.checkBootstrapStatus(ctx, dbc); err != nil {
return err
} else if ready {
.Status.Phase = dbv1alpha1.PhaseRunning
dbcif err := r.Status().Update(ctx, dbc); err != nil {
return err
}
}
return nil
case dbv1alpha1.PhaseRunning:
// Normal operation - check for updates/changes
return r.reconcileRunningStatefulSet(ctx, dbc)
case dbv1alpha1.PhaseUpgrading:
// Handle upgrade process
return r.handleUpgradeProcess(ctx, dbc)
default:
.Error(fmt.Errorf("unknown phase"), "Unexpected phase", "phase", dbc.Status.Phase)
logreturn fmt.Errorf("unknown phase: %s", dbc.Status.Phase)
}
}
I implement comprehensive status management:
func (r *DatabaseClusterReconciler) updateStatusCondition(ctx context.Context, dbc *dbv1alpha1.DatabaseCluster,
.ConditionType, message string) error {
conditionType dbv1alpha1
// Find existing condition
var condition *dbv1alpha1.DatabaseClusterCondition
for i := range dbc.Status.Conditions {
if dbc.Status.Conditions[i].Type == conditionType {
= &dbc.Status.Conditions[i]
condition break
}
}
// If not found, create it
if condition == nil {
.Status.Conditions = append(dbc.Status.Conditions, dbv1alpha1.DatabaseClusterCondition{
dbc: conditionType,
Type})
= &dbc.Status.Conditions[len(dbc.Status.Conditions)-1]
condition }
// Update condition
:= metav1.Now()
now if message == "" {
.Status = metav1.ConditionTrue
condition.LastTransitionTime = now
condition.Message = ""
condition} else {
.Status = metav1.ConditionFalse
condition.LastTransitionTime = now
condition.Message = message
condition}
// Update the resource status
return r.Status().Update(ctx, dbc)
}
I follow these best practices for stable, maintainable reconciliation loops:
Every operation must be idempotent:
func (r *DatabaseClusterReconciler) reconcileConfigMap(ctx context.Context, dbc *dbv1alpha1.DatabaseCluster) error {
// Define the desired ConfigMap
:= &corev1.ConfigMap{
desired : metav1.ObjectMeta{
ObjectMeta: getConfigMapName(dbc),
Name: dbc.Namespace,
Namespace: getLabels(dbc),
Labels: []metav1.OwnerReference{
OwnerReferences*metav1.NewControllerRef(dbc, schema.GroupVersionKind{
: dbv1alpha1.GroupVersion.Group,
Group: dbv1alpha1.GroupVersion.Version,
Version: "DatabaseCluster",
Kind}),
},
},
: map[string]string{
Data"db.conf": generateDBConfig(dbc),
},
}
// Check if it already exists
:= &corev1.ConfigMap{}
existing := r.Get(ctx, types.NamespacedName{Name: desired.Name, Namespace: desired.Namespace}, existing)
err
if err != nil && errors.IsNotFound(err) {
// Create it if it doesn't exist
if err = r.Create(ctx, desired); err != nil {
return fmt.Errorf("failed to create ConfigMap: %w", err)
}
return nil
} else if err != nil {
return fmt.Errorf("failed to get ConfigMap: %w", err)
}
// Compare and update if needed
if !reflect.DeepEqual(existing.Data, desired.Data) || !reflect.DeepEqual(existing.Labels, desired.Labels) {
.Data = desired.Data
existing.Labels = desired.Labels
existingif err = r.Update(ctx, existing); err != nil {
return fmt.Errorf("failed to update ConfigMap: %w", err)
}
}
return nil
}
Implement consistent error handling:
func (r *DatabaseClusterReconciler) handleReconcileError(
.Context,
ctx context*dbv1alpha1.DatabaseCluster,
dbc string,
component error,
err ) (ctrl.Result, error) {
:= r.Log.WithValues("databasecluster", client.ObjectKeyFromObject(dbc))
log .Error(err, "Reconciliation failed", "component", component)
log
// Record event
.Recorder.Event(dbc, corev1.EventTypeWarning,
r"Failed"+component,
.Sprintf("Failed to reconcile %s: %v", component, err),
fmt)
// Update status
:= r.updateStatusCondition(ctx, dbc, dbv1alpha1.ConditionReady,
updateErr .Sprintf("Failed to reconcile %s: %v", component, err))
fmt
if updateErr != nil {
.Error(updateErr, "Failed to update status")
log// If we can't update status, return both errors
return ctrl.Result{}, fmt.Errorf("original error: %v, status update error: %v", err, updateErr)
}
// For some errors, we don't want to requeue immediately
if errors.IsConflict(err) || errors.IsAlreadyExists(err) {
return ctrl.Result{RequeueAfter: time.Second * 5}, nil
}
return ctrl.Result{}, err
}
I ensure clean dependency management by:
// Define an interface for database operations
type DatabaseEngine interface {
(ctx context.Context, name string) error
CreateDatabase(ctx context.Context, username, password string) error
CreateUser(ctx context.Context, username, database string, permissions []string) error
GrantPermissions// Other methods...
}
// Controller using the interface
type DatabaseClusterReconciler struct {
.Client
client.Logger
Log logr*runtime.Scheme
Scheme .EventRecorder
Recorder record
DBEngine DatabaseEngine}
// Implementation for a specific database
type PostgresEngine struct {
// Implementation details
}
func (p *PostgresEngine) CreateDatabase(ctx context.Context, name string) error {
// Postgres-specific implementation
}
// In main.go, wire up the correct implementation
:= &controllers.DatabaseClusterReconciler{
reconciler : mgr.GetClient(),
Client: ctrl.Log.WithName("controllers").WithName("DatabaseCluster"),
Log: mgr.GetScheme(),
Scheme: mgr.GetEventRecorderFor("databasecluster-controller"),
Recorder: &controllers.PostgresEngine{},
DBEngine}
I approach operator testing comprehensively:
I unit test all core logic using mocks:
func TestReconcileConfigMap(t *testing.T) {
// Setup
:= &dbv1alpha1.DatabaseCluster{
dbc : metav1.ObjectMeta{
ObjectMeta: "test-db",
Name: "default",
Namespace},
: dbv1alpha1.DatabaseClusterSpec{
Spec: "13.3",
Version: 3,
Replicas},
}
// Create a fake client
:= []client.Object{dbc}
objs := runtime.NewScheme()
scheme .AddToScheme(scheme)
dbv1alpha1.AddToScheme(scheme)
corev1:= fake.NewClientBuilder().WithScheme(scheme).WithObjects(objs...).Build()
fakeClient
:= &DatabaseClusterReconciler{
reconciler : fakeClient,
Client: scheme,
Scheme: record.NewFakeRecorder(10),
Recorder}
// Execute
:= reconciler.reconcileConfigMap(context.Background(), dbc)
err
// Verify
.NoError(t, err)
assert
// Check if ConfigMap was created
:= &corev1.ConfigMap{}
configMap = fakeClient.Get(context.Background(),
err .NamespacedName{Name: "test-db-config", Namespace: "default"},
types)
configMap
.NoError(t, err)
assert.Equal(t, "test-db-config", configMap.Name)
assert.Contains(t, configMap.Data, "db.conf")
assert}
I use the Kubernetes controller-runtime envtest package for integration testing:
var (
*rest.Config
cfg .Client
k8sClient client*envtest.Environment
testEnv .Context
ctx context.CancelFunc
cancel context)
func TestAPIs(t *testing.T) {
(Fail)
RegisterFailHandler(t, "Controller Suite")
RunSpecs}
var _ = BeforeSuite(func() {
, cancel = context.WithCancel(context.TODO())
ctx
("bootstrapping test environment")
By= &envtest.Environment{
testEnv : []string{filepath.Join("..", "config", "crd", "bases")},
CRDDirectoryPaths: true,
ErrorIfCRDPathMissing}
var err error
, err = testEnv.Start()
cfg(err).NotTo(HaveOccurred())
Expect(cfg).NotTo(BeNil())
Expect
= dbv1alpha1.AddToScheme(scheme.Scheme)
err (err).NotTo(HaveOccurred())
Expect
, err = client.New(cfg, client.Options{Scheme: scheme.Scheme})
k8sClient(err).NotTo(HaveOccurred())
Expect(k8sClient).NotTo(BeNil())
Expect})
var _ = AfterSuite(func() {
()
cancel("tearing down the test environment")
By:= testEnv.Stop()
err (err).NotTo(HaveOccurred())
Expect})
var _ = Describe("DatabaseCluster controller", func() {
("When creating a DatabaseCluster", func() {
Context("Should create associated resources", func() {
It:= &dbv1alpha1.DatabaseCluster{
dbc : metav1.ObjectMeta{
ObjectMeta: "test-db",
Name: "default",
Namespace},
: dbv1alpha1.DatabaseClusterSpec{
Spec: "13.3",
Version: 3,
Replicas},
}
(k8sClient.Create(ctx, dbc)).Should(Succeed())
Expect
// Wait for resources to be created
:= &corev1.ConfigMap{}
configMap (func() bool {
Eventually:= k8sClient.Get(ctx, types.NamespacedName{
err : "test-db-config", Namespace: "default",
Name}, configMap)
return err == nil
}, timeout, interval).Should(BeTrue())
// Verify statefulset was created
:= &appsv1.StatefulSet{}
sts (func() bool {
Eventually:= k8sClient.Get(ctx, types.NamespacedName{
err : "test-db", Namespace: "default",
Name}, sts)
return err == nil
}, timeout, interval).Should(BeTrue())
(sts.Spec.Replicas).To(Equal(pointer.Int32Ptr(3)))
Expect})
})
})
I create actual Kubernetes clusters for end-to-end testing:
func TestOperatorEndToEnd(t *testing.T) {
// Skip if not running in CI environment with K8s access
if os.Getenv("RUN_E2E_TESTS") != "true" {
.Skip("Skipping E2E tests")
t}
// Use the current context in kubeconfig
, err := clientcmd.BuildConfigFromFlags("", os.Getenv("KUBECONFIG"))
configif err != nil {
.Fatalf("Error building kubeconfig: %v", err)
t}
// Create a clientset
, err := kubernetes.NewForConfig(config)
clientsetif err != nil {
.Fatalf("Error creating clientset: %v", err)
t}
// Create dynamic client for CRDs
, err := dynamic.NewForConfig(config)
dynamicClientif err != nil {
.Fatalf("Error creating dynamic client: %v", err)
t}
// Create test namespace
:= &corev1.Namespace{
namespace : metav1.ObjectMeta{
ObjectMeta: "operator-e2e-test",
Name},
}
, err = clientset.CoreV1().Namespaces().Create(context.TODO(), namespace, metav1.CreateOptions{})
_if err != nil && !errors.IsAlreadyExists(err) {
.Fatalf("Error creating namespace: %v", err)
t}
// Run the actual tests...
.Run("CreateDatabaseCluster", func(t *testing.T) {
t// Test creating a DatabaseCluster and verify it works
})
// Clean up
= clientset.CoreV1().Namespaces().Delete(context.TODO(), namespace.Name, metav1.DeleteOptions{})
err if err != nil {
.Fatalf("Error deleting namespace: %v", err)
t}
}
For production-critical operators, I implement chaos testing:
Deploying and monitoring operators requires careful planning:
I use a progressive deployment approach:
Example Helm chart values for production:
# Production values for database-operator
replicaCount: 2
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 200m
memory: 256Mi
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- database-operator
topologyKey: "kubernetes.io/hostname"
rbac:
create: true
clusterRole: true
I implement comprehensive monitoring for operators:
// Define metrics
var (
= promauto.NewCounterVec(
reconcileTotal .CounterOpts{
prometheus: "controller_reconcile_total",
Name: "The total number of reconciliations",
Help},
[]string{"controller", "result"},
)
= promauto.NewHistogramVec(
reconcileDuration .HistogramOpts{
prometheus: "controller_reconcile_duration_seconds",
Name: "The duration of reconciliations",
Help: prometheus.DefBuckets,
Buckets},
[]string{"controller"},
)
= promauto.NewGaugeVec(
resourcesTotal .GaugeOpts{
prometheus: "controller_resources_total",
Name: "The total number of resources by state",
Help},
[]string{"controller", "kind", "state"},
)
)
// Instrument the reconciliation loop
func (r *DatabaseClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
:= "databasecluster"
controllerName := prometheus.NewTimer(reconcileDuration.WithLabelValues(controllerName))
timer defer timer.ObserveDuration()
// Reconciliation logic...
if err != nil {
.WithLabelValues(controllerName, "error").Inc()
reconcileTotalreturn ctrl.Result{}, err
}
.WithLabelValues(controllerName, "success").Inc()
reconcileTotalreturn ctrl.Result{}, nil
}
// Set up structured logging
func setupLogger() logr.Logger {
var opts zap.Options
.Development = false
opts.EncoderConfigOptions = append(
opts.EncoderConfigOptions,
optsfunc(c *zapcore.EncoderConfig) {
.TimeKey = "timestamp"
c.EncodeTime = zapcore.ISO8601TimeEncoder
c},
)
return zapr.NewLogger(zap.New(zap.UseFlagOptions(&opts)))
}
groups:
- name: DatabaseOperatorAlerts
rules:
- alert: DatabaseOperatorHighErrorRate
expr: rate(controller_reconcile_total{controller="databasecluster",result="error"}[5m]) > 0.1
for: 10m
annotations:
summary: "Database operator high error rate"
description: "Database operator has a high error rate for the last 10 minutes"
- alert: DatabaseOperatorReconciliationStalled
expr: count(time() - on(namespace, name) max_over_time(controller_resource_last_reconcile_time{controller="databasecluster"}[1h]) > 3600) > 0
for: 5m
annotations:
summary: "Database reconciliation stalled"
description: "Some database resources haven't been reconciled in over an hour"
To illustrate these principles, I’ll share a real-world operator I developed for a financial services client:
The organization needed to enforce consistent configuration policies across 40+ Kubernetes clusters spanning on-premise and multiple cloud providers. Manual configuration was error-prone and time-consuming.
I designed and built a ConfigSync operator that:
The operator consisted of:
Central ConfigStore CRD:
apiVersion: config.example.com/v1alpha1
kind: ConfigStore
metadata:
name: global-configs
spec:
source:
git:
repository: https://github.com/company/k8s-configs
branch: main
path: /global
schedule: "*/30 * * * *"
validation:
mode: Strict # Strict, Warn, or None
ConfigSync CRD:
apiVersion: config.example.com/v1alpha1
kind: ConfigSync
metadata:
name: cluster-config-sync
spec:
sources:
- name: global
configStore: global-configs
- name: environment
configStore: prod-configs
- name: regional
configStore: us-east-configs
targets:
- kind: NetworkPolicy
apiGroup: networking.k8s.io
namespaces:
- "*"
- kind: ResourceQuota
apiGroup: ""
namespaces:
- default
- kube-system
remediation:
mode: Apply # Apply, Report, or None
Controller Implementation:
GitOps Integration:
The operator transformed the organization’s configuration management:
Building Kubernetes operators is both an art and a science. The most successful operators strike a balance between solving real operational challenges and maintaining simplicity and reliability.
As you approach your own operator development, remember:
By following the approaches outlined in this playbook, you’ll be well-equipped to build operators that not only solve real problems but do so reliably and maintainably.
For further discussion or questions about operator development, feel free to reach out. I’m always interested in hearing about interesting operator use cases and implementation challenges.