|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- virtuoussy/Math-RLVR |
|
- virtuoussy/Multi-subject-RLVR |
|
language: |
|
- en |
|
base_model: |
|
- Qwen/Qwen2.5-7B-Instruct |
|
--- |
|
|
|
Model Details |
|
|
|
The generative reward model used in paper "Expanding RL with Verifiable Rewards Across Diverse Domains". |
|
|
|
Inputting the question, label and the response to be evaluated, the model will judge if the response is right. |
|
|
|
## **Quick start** |
|
|
|
> ```python |
|
> # Load model directly |
|
> from transformers import AutoTokenizer, AutoModelForCausalLM |
|
> |
|
> tokenizer = AutoTokenizer.from_pretrained("virtuoussy/Qwen2.5-7B-Instruct-RLVR") |
|
> model = AutoModelForCausalLM.from_pretrained("virtuoussy/Qwen2.5-7B-Instruct-RLVR") |
|
> |
|
> PROMPT= ''' |
|
> Given a problem, determine whether the final answer in the provided (incomplete) solution process matches the reference answer. |
|
> The reference answer may be one single option character (e.g., A, B, C, D), a numerical value, an expression, or a list of answers if multiple questions are involved. |
|
> **The reference answer may be in Chinese or another language, but your evaluation should be language-agnostic.** |
|
> |
|
> Your task: |
|
> - Compare the final output of the solution process with the reference answer. |
|
> - If they **match exactly**, output **YES**. |
|
> - If they **do not match**, output **NO**. |
|
> - If the solution process is unclear, incomplete, or ambiguous, assume it is incorrect and output **NO**. |
|
> |
|
> Your output must be strictly **'YES'** or **'NO'**, with no additional words, punctuation, or explanation. |
|
> |
|
> --- |
|
> |
|
> **Question:** |
|
> {question} |
|
> |
|
> **Solution Process (Final Step Only):** |
|
> {response} |
|
> |
|
> **Reference Answer:** |
|
> {reference} |
|
> |
|
> **Output:** |
|
> ''' |
|
> |
|
> |
|
> question="The founder of China's first public kindergarten teacher training school - Jiangxi Experimental Kindergarten Teacher School is ( )." |
|
> label="Chen Heqin" |
|
> answer="heqin chen" |
|
> |
|
> prompt_question = PROMPT.format(question=question, reference=label, response=answer) |
|
> messages=[ |
|
> {"role": "system", "content": "You are a helpful assistant."}, |
|
> {"role": "user", "content": prompt_question}, |
|
> ] |
|
> input_ids=tokenizer.apply_chat_template(messages,return_tensors="pt") |
|
> output=model.generate(input_ids,do_sample=False) |
|
> judgement=tokenizer.decode(output[0][input_ids.shape[1]:],skip_special_tokens=True) |
|
> print("Model judgement: ",judgement) |
|
> ``` |
|
|
|
## Use as a remote reward |
|
|
|
```bash |
|
# launch a remote reward |
|
bash launch_reward.sh {MODEL_PATH} {ANSWER_PATH} {METRIC} |
|
|
|
# MODEL_PATH: the path of our generative reward model. |
|
# ANSWER_PATH: the path of the training data. |
|
# METRIC: greedy/prob |
|
# This will launch a reward at http://127.0.0.1:8000/get_reward |
|
|
|
# train |
|
bash train.sh {METHOD} {PRETRAIN_PATH} {DATA_PATH} {REWARD_API} |
|
|
|
# Both train.sh and launch_reward.sh can be found in the model directory. |
|
# We will release our github repo soon! |
|
``` |
|
|
|
|
|
## Citation |
|
|
|
```bibtex |
|
@article{su2025expanding, |
|
title={Expanding RL with Verifiable Rewards Across Diverse Domains}, |
|
author={Su, Yi and Yu, Dian and Song, Linfeng and Li, Juntao and Mi, Haitao and Tu, Zhaopeng and Zhang, Min and Yu, Dong}, |
|
journal={arXiv preprint arXiv:2503.23829}, |
|
year={2025} |
|
} |
|
``` |