A Balanced Data Diet: Mega-Scale RL for Robot Control

01 / 07

Success-Guided Sampling

A Balanced Data Diet

Addressing the bottleneck in mega-scale RL for robot control.

One policy

A single network controls the robot across all terrains.

Training

A sparse success signal and generic regularizers. No demonstrations, no distillation.

Goal

Each terrain gives the policy one target pose at the end.

Architecture

Four to eight layers, no transformer, no LSTM. Terrain comes in as a heightmap.

One run

One continuous run, no resets between them.

Same method

Success-Guided Sampling trains contact-rich assembly the same way.

Scroll to scrub

Scroll to continue

Manipulation

A NIST taskboard task, trained with reinforcement learning, no demonstrations.

Method

Task configurations are sampled by the policy's current success rate, concentrating on the ones it solves about half the time.

Scaling

Success rate continues to rise as parallel environments grow to over one million, 16x prior work.

Real-world

Policies transferred from simulation to physical robots.