SPEAR is a curriculum-based self-imitation learning (SIL) framework for training agentic LLMs on long-horizon, sparse-reward tasks. It balances exploration and exploitation by first leveraging auxiliary tool-use rewards to encourage broad skill-level exploration, and later strengthening self-imitation to exploit successful trajectories from replayed experiences. This adaptive curriculum stabilizes training and improves efficiency while maintaining well-controlled entropy.

Overview

The core concept of our proposed SPEAR for training long-horizon LLM agents via group-based RL. Compared with the vanilla GRPO-like algorithms, we introduce the curriculum-based self-imitation learning with intrinsic reward shaping. Given the same data input, a group of trajectories are generated with multi-turn tool interactions and then engaged for episode-level reward computation and advantage estimation. Then, we propose filtering valuable good trajectories to update the replay buffer, where the stored past experiences guide the agent to explore effectively on sparsely rewarded tasks via self-imitation. The total training batch contains both on-policy and off-policy data from the replay buffer.

Overview of SPEAR in terms of data flow. During each episode, the agent interacts with the environment to generate a set of trajectories. These trajectories are processed along two complementary paths. First, they are used for intrinsic reward shaping, advantage estimation, and on-policy updates, following a mechanism similar to the vanilla GRPO. Second, they are selectively filtered and stored in a replay buffer, enabling off-policy updates through the proposed self-imitation scheme with advantage recalibration and regularization. This dual integration allows the agent to maximize the utility of rewarding past experiences, thereby expanding the exploration space effectively, while simultaneously mitigating persistent over-uncertainty in decision-making under shifting distributions of external feedback. As a result, SPEAR achieves a stable balance between exploration and exploitation through self-guided policy adaptation.

This huggingface model repo contains the checkpoint of SPEAR-ALFWorld-DrBoT-GiGPO-1.5B.

Results

Results using Qwen2.5-1.5B-Instruct on ALFWorld and WebShop:

Method	ALFWorld	WebShop(SR)
GRPO	72.8	56.8
+SPEAR(ours)	88.9(+16.1)	77.5(+20.7)
Dr.BoT(GRPO)	79.1	62.9
+SPEAR(ours)	87.7(+8.6)	76.8(+13.9)
GiGPO	86.1	67.4
+SPEAR(ours)	91.2(+5.1)	79.3(+11.8)
Dr.BoT(GiGPO)	90.6	68.8
+SPEAR(ours)	93.2(+2.6)	81.1(+81.1)

Results using Qwen2.5-32B-Instruct and Qwen3-32B-Instruct on AIME24 and AIME25:

Method	Model	AIME24	AIME25
PPO	Qwen2.5-32B-Instruct	-	55.0
GRPO	Qwen2.5-32B-Instruct	-	60.0
Dr.BoT(GRPO)	Qwen2.5-32B-Instruct	64.7	54.0
+SPEAR(ours)	Qwen2.5-32B-Instruct	66.3(+1.6)	60.1(+6.1)
Dr.BoT(GRPO)	Qwen3-32B-Instruct	82.5	77.3
+SPEAR(ours)	Qwen3-32B-Instruct	85.6(+3.1)	80.5(+3.2)

Citation

If you find the present study useful, please consider the following citation:

@misc{qin2025learnropestrustwins,
      title={Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning}, 
      author={Yulei Qin and Xiaoyu Tan and Zhengbao He and Gang Li and Haojia Lin and Zongyi Li and Zihan Xu and Yuchen Shi and Siqi Cai and Renting Rui and Shaofei Cai and Yuzheng Cai and Xuan Zhang and Sheng Ye and Ke Li and Xing Sun},
      year={2025},
      eprint={2509.22601},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.22601}, 
}

Acknowledgement

Our codebase is bulit upon verl and verl-agent. We greatly appreciate their awesome work and the dedication of the contributors who made these projects available to the community.

Downloads last month: 10

Safetensors

Model size

1.78B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including yolay/SPEAR-ALFWorld-DrBoT-GiGPO-1.5B

SPEAR

Collection

Checkpoints "Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning" arxiv [2509.22601] • 11 items • Updated 4 days ago

yolay
/

SPEAR-ALFWorld-DrBoT-GiGPO-1.5B