Running Robotics Simulation Workloads on GKE Autopilot
Why MuJoCo and Gazebo Belong on Autopilot
Robotics teams increasingly pair reinforcement learning pipelines with high-fidelity simulators such as MuJoCo and Gazebo. These toolchains generate massive amounts of physics data while coordinating GPU-accelerated policy training. Running them on GKE Autopilot eliminates the undifferentiated heavy lifting of cluster capacity management while still giving access to NVIDIA GPUs, spot pricing, and Google Cloud’s observability stack.
Autopilot’s pay-per-pod billing and vertical pod autoscaling are especially attractive for simulation workloads whose utilization spiky is by design. When configured carefully, you can process thousands of rollouts per hour and feed the results straight into distributed learners without managing node pools yourself.
Reference Architecture
The design below has powered my most recent robotics engagement, where a fleet of MuJoCo and Gazebo jobs drove reinforcement learning agents serving in production:
- Autopilot Cluster (Regional)
- Regional Autopilot cluster with the
meshfeature enabled for consistent east-west mTLS.
- GPU workloads annotated with
workload-class: GKE_GPUandautoscaling.knative.dev/minScalefor steady warm capacity.
- Regional Autopilot cluster with the
- Simulation Workloads
- MuJoCo pods packaged as lightweight Python images with GLFW disabled
and EGL for headless rendering.
- Gazebo Classic and Gazebo Garden images split into physics-only and sensor-only replicas to improve pod bin-packing.
- MuJoCo pods packaged as lightweight Python images with GLFW disabled
and EGL for headless rendering.
- Policy Training & Storage
- Ray or Vertex AI custom jobs consuming simulator outputs via GCS and
Pub/Sub.
- Artifacts and textures stored in Artifact Registry with
gziplayers to minimize pull latency.
- Ray or Vertex AI custom jobs consuming simulator outputs via GCS and
Pub/Sub.
- Observability & Governance
- Cloud Logging + Ops Agent sidecars for simulator metrics (step time,
physics FPS, collision counts).
- Fleet Policy Controller enforcing Pod Security Admission
restrictedprofile.
- Cloud Logging + Ops Agent sidecars for simulator metrics (step time,
physics FPS, collision counts).
Outcome: 20k+ parallel simulation steps, sub-minute environment rebuild times, and consistent GPU saturation at 82%+.
Cluster Setup
Create a regional Autopilot cluster with optional GPU workload class and managed service mesh:
gcloud container clusters create-auto robotics-autopilot \
--region us-central1 \
--release-channel rapid \
--workload-policies allow-gpus \
--enable-private-nodes \
--enable-managed-prometheus \
--enable-mesh \
--logging=SYSTEM,WORKLOAD \
--monitoring=SYSTEMAutopilot automatically provisions node resources per pod; you only request what the simulator needs.
Packaging MuJoCo for Headless Rendering
MuJoCo runs efficiently in headless mode with EGL:
# Dockerfile.mujoco
FROM nvidia/cuda:12.2.0-cudnn8-runtime-ubuntu22.04
ENV MUJOCO_VERSION=3.1.2 \
MUJOCO_GL=egl \
MUJOCO_RENDER=offscreen
RUN apt-get update && apt-get install -y \
libgl1-mesa-dev libglu1-mesa libosmesa6 wget unzip python3 python3-pip \
&& rm -rf /var/lib/apt/lists/*
RUN wget https://mujoco.org/download/mujoco-${MUJOCO_VERSION}-linux-x86_64.tar.gz -O /tmp/mujoco.tar.gz \
&& tar -xzf /tmp/mujoco.tar.gz -C /opt \
&& ln -s /opt/mujoco-${MUJOCO_VERSION} /opt/mujoco \
&& rm /tmp/mujoco.tar.gz
COPY requirements.txt /tmp/
RUN pip3 install --no-cache-dir -r /tmp/requirements.txt
COPY . /workspace
WORKDIR /workspace
CMD ["python3", "rollout_worker.py"]Pod manifest requesting GPUs and specifying Autopilot-specific annotations:
apiVersion: v1
kind: Pod
metadata:
name: mujoco-rollout-0
labels:
workload-type: mujoco-rollout
annotations:
autopilot.gke.io/workload-class: "gke-gpu"
spec:
containers:
- name: worker
image: us-central1-docker.pkg.dev/robotics-project/sim/mujoco:3.1.2
resources:
requests:
cpu: "2000m"
memory: 6Gi
ephemeral-storage: 10Gi
nvidia.com/gpu: 1
limits:
cpu: "4000m"
memory: 12Gi
nvidia.com/gpu: 1
env:
- name: MUJOCO_GL
value: "egl"
- name: ROLLOUTS_PER_BATCH
value: "256"Autopilot handles pod-to-node placement and scales GPU capacity
transparently. The resource-adjustment hint prefers upward
adjustments when VPA detects sustained throttling.
Optimizing Gazebo for Throughput
Gazebo simulations often involve heavier CPU footprints. Separate physics and sensor workloads to maximize bin packing:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gazebo-physics
spec:
replicas: 6
selector:
matchLabels:
app: gazebo-physics
template:
metadata:
labels:
app: gazebo-physics
spec:
containers:
- name: physics
image: us-central1-docker.pkg.dev/robotics-project/sim/gazebo-physics:latest
resources:
requests:
cpu: "6000m"
memory: 8Gi
limits:
cpu: "8000m"
memory: 10Gi
env:
- name: GAZEBO_MASTER_URI
value: "http://gazebo-master:11345"
- name: PHYSICS_STEP_HZ
value: "1000"Sensor-processing pods can remain CPU-only, using pod affinity to co-locate with physics pods for shared memory volumes. Autopilot’s VPA throttles keep CPU usage within limits without manual tuning.
Efficient Pipeline Coordination
- Event Streaming – Each simulator publishes rollouts
to Pub/Sub topics with ordering keys per environment.
- Batching & Storage – Dataflow jobs consolidate
episodes into Parquet and write to Cloud Storage for downstream
learners.
- Policy Training – Vertex AI custom training or Ray
clusters running on dedicated Autopilot GPU deployments consume the
aggregated data.
- Model Registry – Trained policies land in Vertex Model Registry, then deploy to KServe GPU inference services (documented in Article 7).
This design let the team iterate on reward shaping twice per day while keeping infrastructure spend predictable.
Cost and Performance Guardrails
Use Spot Pods for Non-critical Rollouts
Autopilot supportsautopilot.gke.io/spot: "true"annotation at the pod level. Pair with checkpointing every N rollouts.Leverage Horizontal Pod Autoscaler
Autopilot integrates with HPA. Base scaling on custom metrics such as step latency or queue depth.Enable Managed Prometheus
Use GPU/DCGM exporters and custom MuJoCo metrics to alert on underutilized instances (<60% GPU busy).Plan for GPU Warm Pools
Cold-starting Autopilot GPU nodes adds ~4 minutes. Keep a small baseline replica count alive during working hours.Sandbox Untrusted Assets
Enforce Pod Security Admissionrestrictedand sign images with Binary Authorization to protect against compromised Gazebo assets.
Key Takeaways
- Autopilot’s pay-per-pod model aligns well with bursty simulation
workloads.
- MuJoCo benefits from GPU scheduling and headless rendering, while
Gazebo thrives once physics and sensor pipelines are decoupled.
- Combining Pub/Sub, Dataflow, and Vertex AI creates a high-throughput
loop from simulation to learner without manual node management.
- Managed observability (Cloud Logging, Managed Prometheus, Ops Agent) keeps rollouts auditable and debuggable at scale.
Add this architecture to your ML platform, and you can continuously train robotics policies while Autopilot handles the infrastructure curveballs.