Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning
Abstract
MINTO, a novel update rule using the minimum estimate between target and online networks, enhances stable and faster value function learning in deep reinforcement learning.
The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.
Community
MINTO is a simple yet effective target bootstrapping method for off-policy RL that enables faster and stable learning and consistently improves performance across algorithms and benchmarks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DEAS: DEtached value learning with Action Sequence for Scalable Offline RL (2025)
- Robust Policy Expansion for Offline-to-Online RL under Diverse Data Corruption (2025)
- An Investigation of Batch Normalization in Off-Policy Actor-Critic Algorithms (2025)
- Unleashing Flow Policies with Distributional Critics (2025)
- XQC: Well-conditioned Optimization Accelerates Deep Reinforcement Learning (2025)
- Physics-informed Value Learner for Offline Goal-Conditioned Reinforcement Learning (2025)
- Gradient Free Deep Reinforcement Learning With TabPFN (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper