A Balanced Data Diet
Addressing the bottleneck in mega-scale RL for robot control.
One policy, every terrain.
A single network controls the robot across all terrains.
No reward engineering.
A sparse success signal and generic regularizers. No demonstrations, no distillation.
Only the goal pose.
Each terrain gives the policy one target pose at the end.
A Markovian MLP policy.
Four to eight layers, no transformer, no LSTM. Terrain comes in as a heightmap.
Ten terrains, back to back.
One continuous run, no resets between them.
The same recipe does manipulation.
Success-Guided Sampling trains contact-rich assembly the same way.
Contact-rich assembly.
A NIST taskboard task, trained with reinforcement learning, no demonstrations.
How SGS works.
Task configurations are sampled by the policy's current success rate, concentrating on the ones it solves about half the time.
Past one million environments.
Success rate continues to rise as parallel environments grow to over one million, 16x prior work.
On real hardware.
Policies transferred from simulation to physical robots.