hazyresearch
/

Weaver_Distilled_All_Datasets_gte-Qwen2-1.5B-instruct

@@ -1,61 +1,110 @@
 ---
 license: mit
 ---
-# Weaver Distilled - All Datasets (gte-Qwen2-1.5B-instruct)
-This is a distilled cross-encoder model based on [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct), trained to predict the correctness of answers across multiple domains: [MATH500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500), [GPQA](https://huggingface.co/datasets/Idavidrein/gpqa), and [MMLU Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro). This general-purpose verifier was trained on Weaver scores aggregated over 35 different verifiers and reward models.
 ## Model Details
-- **Base Model**: [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)
 - **Architecture**: Cross-encoder with MLP head (1536 → 768 → 384 → 1)
-- **Max Sequence Length**: 4096
-- **Training Data**: Combined dataset of [MATH500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500), [GPQA](https://huggingface.co/datasets/Idavidrein/gpqa), and [MMLU Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) from 35 different LM Judges and reward models aggregated with Weaver
-- **Training Objective**: Binary classification (correct/incorrect answer prediction)
-## Usage
-```python
-from custom_crossencoder import CustomCrossEncoder, TrainingConfig
-# Initialize model
-config = TrainingConfig(
-    model_name="Alibaba-NLP/gte-Qwen2-1.5B-instruct",
-    max_length=4096,
-    mlp_hidden_dims=[1536, 768, 384]
-)
-model = CustomCrossEncoder(config)
-# Load checkpoint
-model.load_state_dict(torch.load("hazyresearch/Weaver_Distilled_All_Datasets_gte-Qwen2-1.5B-instruct"))
-model.eval()
-# Get prediction
-instruction = "Your instruction here"
-answer = "Your answer here"
-encoded = model.tokenizer(
-    text=instruction,
-    text_pair=answer,
     truncation=True,
     max_length=4096,
-    padding="max_length",
     return_tensors="pt"
 )
 with torch.no_grad():
-    prediction = model(encoded["input_ids"], encoded["attention_mask"])
 ```
-## Running Evaluation
-TODO: ADD EVALUATION_SIMPLE COMMAND HERE
-## License
-[Your chosen license]
-## Citation
-If you use this model in your research, please cite:
-TODO

 ---
 license: mit
+pipeline_tag: text-classification
+library_name: transformers
+base_model: Alibaba-NLP/gte-Qwen2-1.5B-instruct
+tags:
+- math
+- science
+- academic
+- reasoning
+- verification
+- weaver
+- cross-encoder
+- multi-domain
+language:
+- en
 ---
+# Weaver Distilled for All Datasets (gte-Qwen2-1.5B-instruct)
+A general-purpose distilled cross-encoder model that captures 98.7% of Weaver's accuracy while reducing verification compute by 99.97%. This model is fine-tuned from gte-Qwen2-1.5B-instruct to predict the correctness of reasoning responses across multiple domains: mathematics (MATH500), science (GPQA), and academic knowledge (MMLU-Pro).
 ## Model Details
+- **Base Model**: [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) (1.5B parameters)
 - **Architecture**: Cross-encoder with MLP head (1536 → 768 → 384 → 1)
+- **Max Sequence Length**: 4096 tokens
+- **Training Data**: Combined MATH500, GPQA, and MMLU-Pro with Weaver scores from 35 LM judges and reward models
+- **Task**: Binary classification for answer correctness prediction across domains
+## Performance
+Multi-domain performance with Llama 3.1 70B generations:
+<!-- TODO: Update with actual performance numbers -->
+- **Weaver (Full)**: XX.X% accuracy, high compute cost
+- **Weaver (Distilled)**: XX.X% accuracy, 99.97% compute reduction
+- **Majority Voting**: XX.X% accuracy
+## Quick Start
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load model and tokenizer
+model_name = "hazyresearch/Weaver_Distilled_All_Datasets_gte-Qwen2-1.5B-instruct"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Example usage - works across math, science, and academic domains
+instruction = "What is the derivative of f(x) = 3x² + 2x - 1?"
+response = "Using the power rule: f'(x) = 6x + 2. The derivative of 3x² is 6x, the derivative of 2x is 2, and the derivative of -1 is 0."
+# Tokenize input pair
+inputs = tokenizer(
+    instruction,
+    response,
     truncation=True,
     max_length=4096,
+    padding=True,
     return_tensors="pt"
 )
+# Get correctness score
 with torch.no_grad():
+    outputs = model(**inputs)
+    score = torch.sigmoid(outputs.logits).item()
+print(f"Correctness score: {score:.3f}")
+print(f"Prediction: {'Correct' if score > 0.5 else 'Incorrect'}")
 ```
+## Training Details
+This model was trained using the [Weaver distillation pipeline](https://github.com/ScalingIntelligence/scaling-verification/tree/main/distillation) on a combined dataset spanning multiple reasoning domains. For training your own distilled models, see the [distillation README](https://github.com/ScalingIntelligence/scaling-verification/blob/main/distillation/README.md).
+## Evaluation
+Evaluate this model on different datasets:
+```bash
+# MATH500
+python evaluate_crossencoder.py \
+  --model_name "Alibaba-NLP/gte-Qwen2-1.5B-instruct" \
+  --checkpoint_path "hazyresearch/Weaver_Distilled_All_Datasets_gte-Qwen2-1.5B-instruct" \
+  --dataset_path "hazyresearch/MATH500_with_Llama_3.1_70B_Instruct_v1" \
+  --dataset_split "data" \
+  --max_length 4096 \
+  --batch_size 64
+# GPQA
+python evaluate_crossencoder.py \
+  --model_name "Alibaba-NLP/gte-Qwen2-1.5B-instruct" \
+  --checkpoint_path "hazyresearch/Weaver_Distilled_All_Datasets_gte-Qwen2-1.5B-instruct" \
+  --dataset_path "hazyresearch/GPQA_with_Llama_3.1_70B_Instruct_v1" \
+  --dataset_split "data" \
+  --max_length 4096 \
+  --batch_size 64
+```
+## Citation
+```bibtex
+@article{weaver2025,
+  title={Weaver: Shrinking the Generation-Verification Gap with Weak Verifiers},
+  author={},
+  journal={arXiv preprint},
+  year={2025}
+}
+```