arXiv Paper  

SPEAR spear-logo-in-line is a curriculum-based self-imitation learning (SIL) framework for training agentic LLMs on long-horizon, sparse-reward tasks. It balances exploration and exploitation by first leveraging auxiliary tool-use rewards to encourage broad skill-level exploration, and later strengthening self-imitation to exploit successful trajectories from replayed experiences. This adaptive curriculum stabilizes training and improves efficiency while maintaining well-controlled entropy.

Overview

The core concept of our proposed SPEAR for training long-horizon LLM agents via group-based RL. Compared with the vanilla GRPO-like algorithms, we introduce the curriculum-based self-imitation learning with intrinsic reward shaping. Given the same data input, a group of trajectories are generated with multi-turn tool interactions and then engaged for episode-level reward computation and advantage estimation. Then, we propose filtering valuable good trajectories to update the replay buffer, where the stored past experiences guide the agent to explore effectively on sparsely rewarded tasks via self-imitation. The total training batch contains both on-policy and off-policy data from the replay buffer.

The core concept of our proposed SPEAR.

Overview of SPEAR in terms of data flow. During each episode, the agent interacts with the environment to generate a set of trajectories. These trajectories are processed along two complementary paths. First, they are used for intrinsic reward shaping, advantage estimation, and on-policy updates, following a mechanism similar to the vanilla GRPO. Second, they are selectively filtered and stored in a replay buffer, enabling off-policy updates through the proposed self-imitation scheme with advantage recalibration and regularization. This dual integration allows the agent to maximize the utility of rewarding past experiences, thereby expanding the exploration space effectively, while simultaneously mitigating persistent over-uncertainty in decision-making under shifting distributions of external feedback. As a result, SPEAR achieves a stable balance between exploration and exploitation through self-guided policy adaptation.

Overview of SPEAR in terms of data flow.

Contents

This huggingface model repo contains the checkpoint of SPEAR-ALFWorld-DrBoT-GiGPO-1.5B.

Results

Results using Qwen2.5-1.5B-Instruct on ALFWorld and WebShop:

Method ALFWorld WebShop(SR)
GRPO 72.8 56.8
+SPEAR(ours) 88.9(+16.1) 77.5(+20.7)
Dr.BoT(GRPO) 79.1 62.9
+SPEAR(ours) 87.7(+8.6) 76.8(+13.9)
GiGPO 86.1 67.4
+SPEAR(ours) 91.2(+5.1) 79.3(+11.8)
Dr.BoT(GiGPO) 90.6 68.8
+SPEAR(ours) 93.2(+2.6) 81.1(+81.1)

Results using Qwen2.5-32B-Instruct and Qwen3-32B-Instruct on AIME24 and AIME25:

Method Model AIME24 AIME25
PPO Qwen2.5-32B-Instruct - 55.0
GRPO Qwen2.5-32B-Instruct - 60.0
Dr.BoT(GRPO) Qwen2.5-32B-Instruct 64.7 54.0
+SPEAR(ours) Qwen2.5-32B-Instruct 66.3(+1.6) 60.1(+6.1)
Dr.BoT(GRPO) Qwen3-32B-Instruct 82.5 77.3
+SPEAR(ours) Qwen3-32B-Instruct 85.6(+3.1) 80.5(+3.2)

Citation

If you find the present study useful, please consider the following citation:

@misc{qin2025learnropestrustwins,
      title={Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning}, 
      author={Yulei Qin and Xiaoyu Tan and Zhengbao He and Gang Li and Haojia Lin and Zongyi Li and Zihan Xu and Yuchen Shi and Siqi Cai and Renting Rui and Shaofei Cai and Yuzheng Cai and Xuan Zhang and Sheng Ye and Ke Li and Xing Sun},
      year={2025},
      eprint={2509.22601},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.22601}, 
}

Acknowledgement

Our codebase is bulit upon verl and verl-agent. We greatly appreciate their awesome work and the dedication of the contributors who made these projects available to the community.

Downloads last month
10
Safetensors
Model size
1.78B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including yolay/SPEAR-ALFWorld-DrBoT-GiGPO-1.5B