VERL Fine-tuned Qwen2.5-0.5B on GSM8K (Step 360)

This model is a VERL (Volcano Engine Reinforcement Learning for LLMs) fine-tuned version of Qwen2.5-0.5B-Instruct on the GSM8K mathematical reasoning dataset using PPO.

Model Details

Base Model: Qwen/Qwen2.5-0.5B-Instruct
Training Method: VERL PPO
Dataset: GSM8K (Grade School Math 8K)
Training Steps: 360 steps
Framework: VERL

Training Progress

Initial Accuracy: ~1-2%
Step 20 Accuracy: 26.0%
Final Step 360: Latest checkpoint

Usage

from transformers import AutoTokenizer

# Note: This repository contains the tokenizer and config
# Model weights are in VERL/FSDP format and need conversion
tokenizer = AutoTokenizer.from_pretrained("karthik/verl-qwen2.5-0.5b-gsm8k-ppo-step360")

# For the full model, you would need to convert the VERL checkpoint:
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
# # Then load the VERL checkpoint weights

Training Details

This model was trained using the VERL framework with:

Algorithm: PPO (Proximal Policy Optimization)
Reward: GSM8K correctness scoring
Batch Size: 64-256 (adaptive)
Learning Rate: 1e-6 (actor), 1e-5 (critic)
Training Duration: 360 PPO steps
GPU Memory: 0.2-0.4 utilization

Performance

Shows significant improvement in mathematical reasoning:

Baseline: ~1-2% accuracy
After 20 steps: 26% accuracy
Continued training to 360 steps

Files Included

config.json - Model configuration
tokenizer.json - Tokenizer
tokenizer_config.json - Tokenizer configuration
vocab.json - Vocabulary
merges.txt - BPE merges
special_tokens_map.json - Special tokens
chat_template.jinja - Chat template

Model Weights

The actual model weights are stored in VERL/FSDP format and would need conversion for direct use with transformers. This repository provides the tokenizer and configuration for reference.

Citation

@misc{verl-qwen-gsm8k,
  title={VERL Fine-tuned Qwen2.5-0.5B on GSM8K},
  author={karthik},
  year={2024},
  howpublished={\url{https://huggingface.co/karthik/verl-qwen2.5-0.5b-gsm8k-ppo-step360}},
}

Framework

Trained using VERL - Versatile Reinforcement Learning framework.

Downloads last month: 17

Safetensors

Model size

494M params

Tensor type

F32

Model tree for karthik/verl-qwen2.5-0.5b-gsm8k-ppo-step360

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct