Robotics teams increasingly pair reinforcement learning pipelines with high-fidelity simulators such as MuJoCo and Gazebo. These toolchains generate massive amounts of physics data while coordinating GPU-accelerated policy training. Running them on GKE Autopilot eliminates the undifferentiated heavy lifting of cluster capacity management while still giving access to NVIDIA GPUs, spot pricing, and Google Cloud’s observability stack.
Autopilot’s pay-per-pod billing and vertical pod autoscaling are especially attractive for simulation workloads whose utilization spiky is by design. When configured carefully, you can process thousands of rollouts per hour and feed the results straight into distributed learners without managing node pools yourself.
The design below has powered my most recent robotics engagement, where a fleet of MuJoCo and Gazebo jobs drove reinforcement learning agents serving in production:
mesh
feature
enabled for consistent east-west mTLS.workload-class: GKE_GPU
and autoscaling.knative.dev/minScale
for steady warm
capacity.gzip
layers to minimize pull latency.restricted
profile.Outcome: 20k+ parallel simulation steps, sub-minute environment rebuild times, and consistent GPU saturation at 82%+.
Create a regional Autopilot cluster with optional GPU workload class and managed service mesh:
gcloud container clusters create-auto robotics-autopilot \
--region us-central1 \
--release-channel rapid \
--workload-policies allow-gpus \
--enable-private-nodes \
--enable-managed-prometheus \
--enable-mesh \
--logging=SYSTEM,WORKLOAD \
--monitoring=SYSTEM
Autopilot automatically provisions node resources per pod; you only request what the simulator needs.
MuJoCo runs efficiently in headless mode with EGL:
# Dockerfile.mujoco
FROM nvidia/cuda:12.2.0-cudnn8-runtime-ubuntu22.04
ENV MUJOCO_VERSION=3.1.2 \
\
MUJOCO_GL=egl
MUJOCO_RENDER=offscreen
RUN apt-get update && apt-get install -y \
\
libgl1-mesa-dev libglu1-mesa libosmesa6 wget unzip python3 python3-pip && rm -rf /var/lib/apt/lists/*
RUN wget https://mujoco.org/download/mujoco-${MUJOCO_VERSION}-linux-x86_64.tar.gz -O /tmp/mujoco.tar.gz \
&& tar -xzf /tmp/mujoco.tar.gz -C /opt \
&& ln -s /opt/mujoco-${MUJOCO_VERSION} /opt/mujoco \
&& rm /tmp/mujoco.tar.gz
COPY requirements.txt /tmp/
RUN pip3 install --no-cache-dir -r /tmp/requirements.txt
COPY . /workspace
WORKDIR /workspace
CMD ["python3", "rollout_worker.py"]
Pod manifest requesting GPUs and specifying Autopilot-specific annotations:
apiVersion: v1
kind: Pod
metadata:
name: mujoco-rollout-0
labels:
workload-type: mujoco-rollout
annotations:
autopilot.gke.io/workload-class: "gke-gpu"
spec:
containers:
- name: worker
image: us-central1-docker.pkg.dev/robotics-project/sim/mujoco:3.1.2
resources:
requests:
cpu: "2000m"
memory: 6Gi
ephemeral-storage: 10Gi
nvidia.com/gpu: 1
limits:
cpu: "4000m"
memory: 12Gi
nvidia.com/gpu: 1
env:
- name: MUJOCO_GL
value: "egl"
- name: ROLLOUTS_PER_BATCH
value: "256"
Autopilot handles pod-to-node placement and scales GPU capacity
transparently. The resource-adjustment
hint prefers upward
adjustments when VPA detects sustained throttling.
Gazebo simulations often involve heavier CPU footprints. Separate physics and sensor workloads to maximize bin packing:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gazebo-physics
spec:
replicas: 6
selector:
matchLabels:
app: gazebo-physics
template:
metadata:
labels:
app: gazebo-physics
spec:
containers:
- name: physics
image: us-central1-docker.pkg.dev/robotics-project/sim/gazebo-physics:latest
resources:
requests:
cpu: "6000m"
memory: 8Gi
limits:
cpu: "8000m"
memory: 10Gi
env:
- name: GAZEBO_MASTER_URI
value: "http://gazebo-master:11345"
- name: PHYSICS_STEP_HZ
value: "1000"
Sensor-processing pods can remain CPU-only, using pod affinity to co-locate with physics pods for shared memory volumes. Autopilot’s VPA throttles keep CPU usage within limits without manual tuning.
This design let the team iterate on reward shaping twice per day while keeping infrastructure spend predictable.
Use Spot Pods for Non-critical Rollouts
Autopilot supports autopilot.gke.io/spot: "true"
annotation
at the pod level. Pair with checkpointing every N rollouts.
Leverage Horizontal Pod Autoscaler
Autopilot integrates with HPA. Base scaling on custom metrics such as
step latency or queue depth.
Enable Managed Prometheus
Use GPU/DCGM exporters and custom MuJoCo metrics to alert on
underutilized instances (<60% GPU busy).
Plan for GPU Warm Pools
Cold-starting Autopilot GPU nodes adds ~4 minutes. Keep a small baseline
replica count alive during working hours.
Sandbox Untrusted Assets
Enforce Pod Security Admission restricted
and sign images
with Binary Authorization to protect against compromised Gazebo
assets.
Add this architecture to your ML platform, and you can continuously train robotics policies while Autopilot handles the infrastructure curveballs.