|
--- |
|
license: mit |
|
pipeline_tag: text-classification |
|
library_name: transformers |
|
base_model: Alibaba-NLP/gte-Qwen2-1.5B-instruct |
|
tags: |
|
- math |
|
- science |
|
- academic |
|
- reasoning |
|
- verification |
|
- weaver |
|
- cross-encoder |
|
- multi-domain |
|
language: |
|
- en |
|
--- |
|
|
|
# Weaver Distilled for All Datasets (gte-Qwen2-1.5B-instruct) |
|
|
|
A general-purpose distilled cross-encoder model based on gte-Qwen2-1.5B-instruct, trained to predict the correctness of reasoning responses across multiple domains: mathematics (MATH500), science (GPQA), and academic knowledge (MMLU-Pro). This specialized verifier was trained on Weaver scores aggregated over 35 different verifiers and reward models. |
|
|
|
## Model Details |
|
|
|
- **Base Model**: [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) (1.5B parameters) |
|
- **Architecture**: Cross-encoder with MLP head (1536 → 768 → 384 → 1) |
|
- **Max Sequence Length**: 4096 tokens |
|
- **Training Data**: Combined MATH500, GPQA, and MMLU-Pro with Weaver scores from 35 LM judges and reward models |
|
- **Task**: Binary classification for answer correctness prediction across domains |
|
|
|
## Quick Start |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
|
|
# Load model and tokenizer |
|
model_name = "hazyresearch/Weaver_Distilled_All_Datasets_gte-Qwen2-1.5B-instruct" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
# Example usage - works across math, science, and academic domains |
|
instruction = "What is the derivative of f(x) = 3x² + 2x - 1?" |
|
response = "Using the power rule: f'(x) = 6x + 2. The derivative of 3x² is 6x, the derivative of 2x is 2, and the derivative of -1 is 0." |
|
|
|
# Tokenize input pair |
|
inputs = tokenizer( |
|
instruction, |
|
response, |
|
truncation=True, |
|
max_length=4096, |
|
padding=True, |
|
return_tensors="pt" |
|
) |
|
|
|
# Get correctness score |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
score = torch.sigmoid(outputs.logits).item() |
|
|
|
print(f"Correctness score: {score:.3f}") |
|
print(f"Prediction: {'Correct' if score > 0.5 else 'Incorrect'}") |
|
``` |
|
|
|
## Training Details |
|
|
|
This model was trained using the [Weaver distillation pipeline](https://github.com/HazyResearch/scaling-verification) on a combined dataset spanning multiple reasoning domains. For training your own distilled models, see the [distillation README](https://github.com/HazyResearch/scaling-verification/blob/main/distillation/README.md). |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{saadfalcon2025shrinkinggenerationverificationgapweak, |
|
title={Shrinking the Generation-Verification Gap with Weak Verifiers}, |
|
author={Jon Saad-Falcon and E. Kelly Buchanan and Mayee F. Chen and Tzu-Heng Huang and Brendan McLaughlin and Tanvir Bhathal and Shang Zhu and Ben Athiwaratkun and Frederic Sala and Scott Linderman and Azalia Mirhoseini and Christopher Ré}, |
|
year={2025}, |
|
eprint={2506.18203}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CR}, |
|
url={https://arxiv.org/abs/2506.18203}, |
|
} |
|
``` |