File size: 3,073 Bytes
ba2e06b 41e4cf8 ba2e06b 1a649e0 41e4cf8 aa12b28 1a649e0 41e4cf8 1a649e0 41e4cf8 1a649e0 41e4cf8 1a649e0 41e4cf8 1a649e0 41e4cf8 1a649e0 41e4cf8 348de8e 41e4cf8 2c068c7 1d9f36e 348de8e 41e4cf8 348de8e 41e4cf8 6f7a2d2 41e4cf8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
---
license: mit
pipeline_tag: text-classification
library_name: transformers
base_model: Alibaba-NLP/gte-Qwen2-1.5B-instruct
tags:
- math
- science
- academic
- reasoning
- verification
- weaver
- cross-encoder
- multi-domain
language:
- en
---
# Weaver Distilled for All Datasets (gte-Qwen2-1.5B-instruct)
A general-purpose distilled cross-encoder model based on gte-Qwen2-1.5B-instruct, trained to predict the correctness of reasoning responses across multiple domains: mathematics (MATH500), science (GPQA), and academic knowledge (MMLU-Pro). This specialized verifier was trained on Weaver scores aggregated over 35 different verifiers and reward models.
## Model Details
- **Base Model**: [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) (1.5B parameters)
- **Architecture**: Cross-encoder with MLP head (1536 → 768 → 384 → 1)
- **Max Sequence Length**: 4096 tokens
- **Training Data**: Combined MATH500, GPQA, and MMLU-Pro with Weaver scores from 35 LM judges and reward models
- **Task**: Binary classification for answer correctness prediction across domains
## Quick Start
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "hazyresearch/Weaver_Distilled_All_Datasets_gte-Qwen2-1.5B-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example usage - works across math, science, and academic domains
instruction = "What is the derivative of f(x) = 3x² + 2x - 1?"
response = "Using the power rule: f'(x) = 6x + 2. The derivative of 3x² is 6x, the derivative of 2x is 2, and the derivative of -1 is 0."
# Tokenize input pair
inputs = tokenizer(
instruction,
response,
truncation=True,
max_length=4096,
padding=True,
return_tensors="pt"
)
# Get correctness score
with torch.no_grad():
outputs = model(**inputs)
score = torch.sigmoid(outputs.logits).item()
print(f"Correctness score: {score:.3f}")
print(f"Prediction: {'Correct' if score > 0.5 else 'Incorrect'}")
```
## Training Details
This model was trained using the [Weaver distillation pipeline](https://github.com/HazyResearch/scaling-verification) on a combined dataset spanning multiple reasoning domains. For training your own distilled models, see the [distillation README](https://github.com/HazyResearch/scaling-verification/blob/main/distillation/README.md).
## Citation
```bibtex
@misc{saadfalcon2025shrinkinggenerationverificationgapweak,
title={Shrinking the Generation-Verification Gap with Weak Verifiers},
author={Jon Saad-Falcon and E. Kelly Buchanan and Mayee F. Chen and Tzu-Heng Huang and Brendan McLaughlin and Tanvir Bhathal and Shang Zhu and Ben Athiwaratkun and Frederic Sala and Scott Linderman and Azalia Mirhoseini and Christopher Ré},
year={2025},
eprint={2506.18203},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2506.18203},
}
``` |