Text Generation
Transformers
Safetensors
qwen2
conversational
text-generation-inference

RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning

This repository contains the qwen2.5-math-rlep model, which is a key checkpoint from the RLEP training process based on Qwen2.5-Math-7B, as presented in the paper RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning.

Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. RLEP -- Reinforcement Learning with Experience rePlay -- is a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance.

[Paper] [Code] [Checkpoints] [Dataset]

RLEP Method Overview

✨ Key Highlights

  • Rapid early gains: On AIME-2024 RLEP hits the baseline’s peak accuracy by step 135 (the baseline needs 380). On AIME-2025 it surpasses the baseline’s best score after only 50 steps.
  • Higher final performance: RLEP ultimately lifts the peak accuracy from 38.2% β†’ 39.9% (AIME-2024), 19.8% β†’ 22.3% (AIME-2025), and 77.0% β†’ 82.2% on AMC-2023 benchmark.

RLEP Experimental Accuracy

πŸš€ Quick Start (Inference)

Here’s a simple example of running inference with vLLM. First, install vLLM (versionβ€―β‰₯β€―0.7.3):

pip3 install vllm

After installation, you can load and run the model in your Python code like this:

import os

from transformers import AutoModelForCausalLM, AutoTokenizer
from vllm import LLM, SamplingParams

model_path = 'Kwai-Klear/qwen2.5-math-rlep'
sampling_params = SamplingParams(temperature=1.0, top_p=1.0, max_tokens=1024 * 3, n=1)
llm = LLM(
    model=model_path,
    enforce_eager=False,
    tensor_parallel_size=1,
    seed=0,
)

tokenizer = AutoTokenizer.from_pretrained(model_path)
question = '''Find the sum of all integer bases $b>9$ for which $17_b$ is a divisor of $97_b.$'''

prefix="Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\n"
post_fix = '\n\nRemember to put your answer on its own line after "Answer:".'
question_with_instruct = prefix + question + post_fix  # the model is trained with this instruct. 
messages = [{'content': question_with_instruct, 'role':'user'}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

output =llm.generate([text], sampling_params)[0]
answer = output.outputs[0].text

print(question)
print(answer)

To evaluete the model on benchmarks like AIME-2024, AIME-2025 and AMC-2023 etc. please refer to our repo.

Evaluation Results

We evaluated the converged RLEP model at 320 training steps and the DAPO-nodyn-bs64 baseline at 400 steps.

AIME-2024 AIME-2025 AMC-2023
DAPO 32.6 18.9 77.5
DAPO-nodyn-bs64 37.4 19.4 77.3
RLEP 38.5 21.3 83.0

Citation

If you find our paper or code helpful, we would appreciate it if you could cite our work:

@misc{zhang2025rlepreinforcementlearningexperience,
      title={RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning}, 
      author={Hongzhi Zhang and Jia Fu and Jingyuan Zhang and Kai Fu and Qi Wang and Fuzheng Zhang and Guorui Zhou},
      year={2025},
      eprint={2507.07451},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.07451}, 
}

Acknowledgement

We conducted our experiments with the VERL framework and the Qwen2.5-7B-Math model, using the dataset and training scripts provided by DAPO. Many thanks to the open-sourced works and the broader community for making these resources available!

Downloads last month
11
Safetensors
Model size
7.62B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Kwai-Klear/qwen2.5-math-rlep

Base model

Qwen/Qwen2.5-7B
Finetuned
(633)
this model

Datasets used to train Kwai-Klear/qwen2.5-math-rlep

Collection including Kwai-Klear/qwen2.5-math-rlep