VERL Fine-tuned Qwen2.5-0.5B on GSM8K (Step 360)
This model is a VERL (Volcano Engine Reinforcement Learning for LLMs) fine-tuned version of Qwen2.5-0.5B-Instruct on the GSM8K mathematical reasoning dataset using PPO.
Model Details
- Base Model: Qwen/Qwen2.5-0.5B-Instruct
- Training Method: VERL PPO
- Dataset: GSM8K (Grade School Math 8K)
- Training Steps: 360 steps
- Framework: VERL
Training Progress
- Initial Accuracy: ~1-2%
- Step 20 Accuracy: 26.0%
- Final Step 360: Latest checkpoint
Usage
from transformers import AutoTokenizer
# Note: This repository contains the tokenizer and config
# Model weights are in VERL/FSDP format and need conversion
tokenizer = AutoTokenizer.from_pretrained("karthik/verl-qwen2.5-0.5b-gsm8k-ppo-step360")
# For the full model, you would need to convert the VERL checkpoint:
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
# # Then load the VERL checkpoint weights
Training Details
This model was trained using the VERL framework with:
- Algorithm: PPO (Proximal Policy Optimization)
- Reward: GSM8K correctness scoring
- Batch Size: 64-256 (adaptive)
- Learning Rate: 1e-6 (actor), 1e-5 (critic)
- Training Duration: 360 PPO steps
- GPU Memory: 0.2-0.4 utilization
Performance
Shows significant improvement in mathematical reasoning:
- Baseline: ~1-2% accuracy
- After 20 steps: 26% accuracy
- Continued training to 360 steps
Files Included
config.json
- Model configurationtokenizer.json
- Tokenizertokenizer_config.json
- Tokenizer configurationvocab.json
- Vocabularymerges.txt
- BPE mergesspecial_tokens_map.json
- Special tokenschat_template.jinja
- Chat template
Model Weights
The actual model weights are stored in VERL/FSDP format and would need conversion for direct use with transformers. This repository provides the tokenizer and configuration for reference.
Citation
@misc{verl-qwen-gsm8k,
title={VERL Fine-tuned Qwen2.5-0.5B on GSM8K},
author={karthik},
year={2024},
howpublished={\url{https://huggingface.co/karthik/verl-qwen2.5-0.5b-gsm8k-ppo-step360}},
}
Framework
Trained using VERL - Versatile Reinforcement Learning framework.
- Downloads last month
- 17