Jagadesh - Kubernetes Expert

Running Robotics Simulation Workloads on GKE Autopilot

Why MuJoCo and Gazebo Belong on Autopilot

Robotics teams increasingly pair reinforcement learning pipelines with high-fidelity simulators such as MuJoCo and Gazebo. These toolchains generate massive amounts of physics data while coordinating GPU-accelerated policy training. Running them on GKE Autopilot eliminates the undifferentiated heavy lifting of cluster capacity management while still giving access to NVIDIA GPUs, spot pricing, and Google Cloud’s observability stack.

Autopilot’s pay-per-pod billing and vertical pod autoscaling are especially attractive for simulation workloads whose utilization spiky is by design. When configured carefully, you can process thousands of rollouts per hour and feed the results straight into distributed learners without managing node pools yourself.

Reference Architecture

The design below has powered my most recent robotics engagement, where a fleet of MuJoCo and Gazebo jobs drove reinforcement learning agents serving in production:

Autopilot Cluster (Regional)
- Regional Autopilot cluster with the mesh feature enabled for consistent east-west mTLS.
- GPU workloads annotated with workload-class: GKE_GPU and autoscaling.knative.dev/minScale for steady warm capacity.
Simulation Workloads
- MuJoCo pods packaged as lightweight Python images with GLFW disabled and EGL for headless rendering.
- Gazebo Classic and Gazebo Garden images split into physics-only and sensor-only replicas to improve pod bin-packing.
Policy Training & Storage
- Ray or Vertex AI custom jobs consuming simulator outputs via GCS and Pub/Sub.
- Artifacts and textures stored in Artifact Registry with gzip layers to minimize pull latency.
Observability & Governance
- Cloud Logging + Ops Agent sidecars for simulator metrics (step time, physics FPS, collision counts).
- Fleet Policy Controller enforcing Pod Security Admission restricted profile.

Outcome: 20k+ parallel simulation steps, sub-minute environment rebuild times, and consistent GPU saturation at 82%+.

Cluster Setup

Create a regional Autopilot cluster with optional GPU workload class and managed service mesh:

gcloud container clusters create-auto robotics-autopilot \
  --region us-central1 \
  --release-channel rapid \
  --workload-policies allow-gpus \
  --enable-private-nodes \
  --enable-managed-prometheus \
  --enable-mesh \
  --logging=SYSTEM,WORKLOAD \
  --monitoring=SYSTEM

Autopilot automatically provisions node resources per pod; you only request what the simulator needs.

Packaging MuJoCo for Headless Rendering

MuJoCo runs efficiently in headless mode with EGL:

# Dockerfile.mujoco
FROM nvidia/cuda:12.2.0-cudnn8-runtime-ubuntu22.04
ENV MUJOCO_VERSION=3.1.2 \
    MUJOCO_GL=egl \
    MUJOCO_RENDER=offscreen

RUN apt-get update && apt-get install -y \
    libgl1-mesa-dev libglu1-mesa libosmesa6 wget unzip python3 python3-pip \
 && rm -rf /var/lib/apt/lists/*

RUN wget https://mujoco.org/download/mujoco-${MUJOCO_VERSION}-linux-x86_64.tar.gz -O /tmp/mujoco.tar.gz \
 && tar -xzf /tmp/mujoco.tar.gz -C /opt \
 && ln -s /opt/mujoco-${MUJOCO_VERSION} /opt/mujoco \
 && rm /tmp/mujoco.tar.gz

COPY requirements.txt /tmp/
RUN pip3 install --no-cache-dir -r /tmp/requirements.txt
COPY . /workspace
WORKDIR /workspace
CMD ["python3", "rollout_worker.py"]

Pod manifest requesting GPUs and specifying Autopilot-specific annotations:

apiVersion: v1
kind: Pod
metadata:
  name: mujoco-rollout-0
  labels:
    workload-type: mujoco-rollout
  annotations:
    autopilot.gke.io/workload-class: "gke-gpu"
spec:
  containers:
  - name: worker
    image: us-central1-docker.pkg.dev/robotics-project/sim/mujoco:3.1.2
    resources:
      requests:
        cpu: "2000m"
        memory: 6Gi
        ephemeral-storage: 10Gi
        nvidia.com/gpu: 1
      limits:
        cpu: "4000m"
        memory: 12Gi
        nvidia.com/gpu: 1
    env:
    - name: MUJOCO_GL
      value: "egl"
    - name: ROLLOUTS_PER_BATCH
      value: "256"

Autopilot handles pod-to-node placement and scales GPU capacity transparently. The resource-adjustment hint prefers upward adjustments when VPA detects sustained throttling.

Optimizing Gazebo for Throughput

Gazebo simulations often involve heavier CPU footprints. Separate physics and sensor workloads to maximize bin packing:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gazebo-physics
spec:
  replicas: 6
  selector:
    matchLabels:
      app: gazebo-physics
  template:
    metadata:
      labels:
        app: gazebo-physics
    spec:
      containers:
      - name: physics
        image: us-central1-docker.pkg.dev/robotics-project/sim/gazebo-physics:latest
        resources:
          requests:
            cpu: "6000m"
            memory: 8Gi
          limits:
            cpu: "8000m"
            memory: 10Gi
        env:
        - name: GAZEBO_MASTER_URI
          value: "http://gazebo-master:11345"
        - name: PHYSICS_STEP_HZ
          value: "1000"

Sensor-processing pods can remain CPU-only, using pod affinity to co-locate with physics pods for shared memory volumes. Autopilot’s VPA throttles keep CPU usage within limits without manual tuning.

Efficient Pipeline Coordination

Event Streaming – Each simulator publishes rollouts to Pub/Sub topics with ordering keys per environment.
Batching & Storage – Dataflow jobs consolidate episodes into Parquet and write to Cloud Storage for downstream learners.
Policy Training – Vertex AI custom training or Ray clusters running on dedicated Autopilot GPU deployments consume the aggregated data.
Model Registry – Trained policies land in Vertex Model Registry, then deploy to KServe GPU inference services (documented in Article 7).

This design let the team iterate on reward shaping twice per day while keeping infrastructure spend predictable.

Cost and Performance Guardrails

Use Spot Pods for Non-critical Rollouts
Autopilot supports autopilot.gke.io/spot: "true" annotation at the pod level. Pair with checkpointing every N rollouts.
Leverage Horizontal Pod Autoscaler
Autopilot integrates with HPA. Base scaling on custom metrics such as step latency or queue depth.
Enable Managed Prometheus
Use GPU/DCGM exporters and custom MuJoCo metrics to alert on underutilized instances (<60% GPU busy).
Plan for GPU Warm Pools
Cold-starting Autopilot GPU nodes adds ~4 minutes. Keep a small baseline replica count alive during working hours.
Sandbox Untrusted Assets
Enforce Pod Security Admission restricted and sign images with Binary Authorization to protect against compromised Gazebo assets.

Key Takeaways

Autopilot’s pay-per-pod model aligns well with bursty simulation workloads.
MuJoCo benefits from GPU scheduling and headless rendering, while Gazebo thrives once physics and sensor pipelines are decoupled.
Combining Pub/Sub, Dataflow, and Vertex AI creates a high-throughput loop from simulation to learner without manual node management.
Managed observability (Cloud Logging, Managed Prometheus, Ops Agent) keeps rollouts auditable and debuggable at scale.

Add this architecture to your ML platform, and you can continuously train robotics policies while Autopilot handles the infrastructure curveballs.