VERL Fine-tuned Qwen2.5-0.5B on GSM8K (Step 360)

This model is a VERL (Volcano Engine Reinforcement Learning for LLMs) fine-tuned version of Qwen2.5-0.5B-Instruct on the GSM8K mathematical reasoning dataset using PPO.

Model Details

  • Base Model: Qwen/Qwen2.5-0.5B-Instruct
  • Training Method: VERL PPO
  • Dataset: GSM8K (Grade School Math 8K)
  • Training Steps: 360 steps
  • Framework: VERL

Training Progress

  • Initial Accuracy: ~1-2%
  • Step 20 Accuracy: 26.0%
  • Final Step 360: Latest checkpoint

Usage

from transformers import AutoTokenizer

# Note: This repository contains the tokenizer and config
# Model weights are in VERL/FSDP format and need conversion
tokenizer = AutoTokenizer.from_pretrained("karthik/verl-qwen2.5-0.5b-gsm8k-ppo-step360")

# For the full model, you would need to convert the VERL checkpoint:
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
# # Then load the VERL checkpoint weights

Training Details

This model was trained using the VERL framework with:

  • Algorithm: PPO (Proximal Policy Optimization)
  • Reward: GSM8K correctness scoring
  • Batch Size: 64-256 (adaptive)
  • Learning Rate: 1e-6 (actor), 1e-5 (critic)
  • Training Duration: 360 PPO steps
  • GPU Memory: 0.2-0.4 utilization

Performance

Shows significant improvement in mathematical reasoning:

  • Baseline: ~1-2% accuracy
  • After 20 steps: 26% accuracy
  • Continued training to 360 steps

Files Included

  • config.json - Model configuration
  • tokenizer.json - Tokenizer
  • tokenizer_config.json - Tokenizer configuration
  • vocab.json - Vocabulary
  • merges.txt - BPE merges
  • special_tokens_map.json - Special tokens
  • chat_template.jinja - Chat template

Model Weights

The actual model weights are stored in VERL/FSDP format and would need conversion for direct use with transformers. This repository provides the tokenizer and configuration for reference.

Citation

@misc{verl-qwen-gsm8k,
  title={VERL Fine-tuned Qwen2.5-0.5B on GSM8K},
  author={karthik},
  year={2024},
  howpublished={\url{https://huggingface.co/karthik/verl-qwen2.5-0.5b-gsm8k-ppo-step360}},
}

Framework

Trained using VERL - Versatile Reinforcement Learning framework.

Downloads last month
17
Safetensors
Model size
494M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for karthik/verl-qwen2.5-0.5b-gsm8k-ppo-step360

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(497)
this model

Dataset used to train karthik/verl-qwen2.5-0.5b-gsm8k-ppo-step360