Papers
arxiv:2509.19301

Residual Off-Policy RL for Finetuning Behavior Cloning Policies

Published on Sep 23
· Submitted by Lars Lien Ankile on Sep 26
Authors:
,
,
,
,
,

Abstract

A residual learning framework combines behavior cloning and reinforcement learning to improve manipulation policies on high-degree-of-freedom systems using sparse binary rewards.

AI-generated summary

Recent advances in behavior cloning (BC) have enabled impressive visuomotor control policies. However, these approaches are limited by the quality of human demonstrations, the manual effort required for data collection, and the diminishing returns from increasing offline data. In comparison, reinforcement learning (RL) trains an agent through autonomous interaction with the environment and has shown remarkable success in various domains. Still, training RL policies directly on real-world robots remains challenging due to sample inefficiency, safety concerns, and the difficulty of learning from sparse rewards for long-horizon tasks, especially for high-degree-of-freedom (DoF) systems. We present a recipe that combines the benefits of BC and RL through a residual learning framework. Our approach leverages BC policies as black-box bases and learns lightweight per-step residual corrections via sample-efficient off-policy RL. We demonstrate that our method requires only sparse binary reward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems in both simulation and the real world. In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands. Our results demonstrate state-of-the-art performance in various vision-based tasks, pointing towards a practical pathway for deploying RL in the real world. Project website: https://residual-offpolicy-rl.github.io

Community

We present ResFiT, a method that makes real-world RL possible on a 29-DoF bimanual humanoid. Fine-tuning experiments in real boosts success rates from ~14–23% to ~64% with ~15–76 minutes of robot interaction data. We achieve this by freezing a pre-trained action-chunked BC policy and learning single-step residual corrections on top via off-policy RL from sparse rewards.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.19301 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.19301 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.19301 in a Space README.md to link it from this page.

Collections including this paper 2