Spaces:
Running
Running
Update leaderboard/md.py
Browse files- leaderboard/md.py +74 -2
leaderboard/md.py
CHANGED
@@ -2,7 +2,79 @@ from datetime import datetime
|
|
2 |
|
3 |
import pytz
|
4 |
|
5 |
-
ABOUT_TEXT_V2 = """
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
|
7 |
ABOUT_TEXT_V1 = """
|
8 |
We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
|
@@ -106,7 +178,7 @@ TOP_TEXT = """# RewardBench: Evaluating Reward Models"""
|
|
106 |
|
107 |
CAPTION_V2 = f"""The *new version* of RewardBench that is based on unseen human data and designed to be substantially more difficult!
|
108 |
|
109 |
-
[Code](https://github.com/allenai/reward-bench) | [Eval. Dataset v2](https://huggingface.co/datasets/allenai/reward-bench-2) | [Results v2](https://huggingface.co/datasets/allenai/reward-bench-2-results) | [Paper
|
110 |
|
111 |
CAPTION_V1 = f"""The original RewardBench -- the first reward model evaluation.
|
112 |
|
|
|
2 |
|
3 |
import pytz
|
4 |
|
5 |
+
ABOUT_TEXT_V2 = """
|
6 |
+
The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories:
|
7 |
+
1. **Factuality** (*NEW!*): Tests the ability of RMs to detect hallucinations and other basic errors in completions.
|
8 |
+
2. **Precise Instruction Following** (*NEW!*): Tests the ability of RMs to judge whether text follows precise instructions, such as "Answer without the letter u".
|
9 |
+
3. **Math**: Tests RMs' abilities at math, on open-ended human prompts ranging from middle school physics and geometry to college-level chemistry, calculus, combinatorics, and more.
|
10 |
+
4. **Safety**: Tests RMs' abilities to correctly comply with or refuse prompts related to harmful use cases as well as general compliance behaviors.
|
11 |
+
5. **Focus**: Tests RMs' ability to detect high-quality, on-topic answers to general user queries.
|
12 |
+
6. **Ties** (*NEW*!): This new type of subset tests the robustness of RMs in domains with many possible similar answers. For example, the question "Name a color of the rainbow" has seven possible correct answers and infinitely many incorrect ones.
|
13 |
+
|
14 |
+
The RewardBench 2 leaderboard averages over these six subsets.
|
15 |
+
For the first five categories, the scoring for RewardBench 2 evaluates success as whether the score of a prompt-chosen pair is greater than the score of *three* prompt-rejected pairs.
|
16 |
+
The "Ties" score is a weighted score of accuracy (as measured by *all* valid correct answers being scored higher than *all* incorrect answers) and whether the reward margin between correct and incorrect answers exceeds that of the highest and lowest-scored correct responses. This metric rewards not only correctness, but also a model's ability to prioritize correct answers over incorrect ones more strongly than it distinguishes between equally valid correct responses.
|
17 |
+
|
18 |
+
<img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/reward-bench/main-fig-hor.png" alt="RewardBench 2 Flow" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
|
19 |
+
|
20 |
+
## Dataset Construction Summary
|
21 |
+
| Domain | Count | Prompt Source | Method of generating completions | Completion Filtering |
|
22 |
+
|--------|-------|---------------|----------------------------------|---------------------|
|
23 |
+
| Factuality | 475 | Human | Both | Multi-LM-as-a-judge |
|
24 |
+
| Precise IF | 160 | Human | Natural | Verifier functions |
|
25 |
+
| Math | 183 | Human | Natural | Majority voting |
|
26 |
+
| Safety | 450 | CoCoNot | Both | LM-as-a-judge & rubrics |
|
27 |
+
| Focus | 495 | Human | System Prompt Variation | N/A |
|
28 |
+
| Ties | 102 | Manual | System Prompt Variation | Manual verification |
|
29 |
+
|
30 |
+
## Dataset Details
|
31 |
+
|
32 |
+
Each sample in the dataset has the following items.
|
33 |
+
Note, the dataset is single-turn:
|
34 |
+
* `prompt` (`str`): the instruction given in the various test sets.
|
35 |
+
* `chosen` (`list[str]`): the chosen response(s) (1 chosen response for all subsets but ties)
|
36 |
+
* `rejected` (`list[str]`): the rejected responses (3 chosen responses for all subsets but ties)
|
37 |
+
* `num_correct` (`int`): the number of chosen responses
|
38 |
+
* `num_rejected` (`int`): the number of rejected responses
|
39 |
+
* `total_completions` (`int`): the total number of responses
|
40 |
+
* `models` (`list[str]`): a list of models that the chosen and rejected responses are generated from, respectively
|
41 |
+
* `subset` (`str`): the subset the datapoint is part of.
|
42 |
+
* `id` (`int`): an incremented id for every prompt in the benchmark.
|
43 |
+
|
44 |
+
To select a specific subset use HuggingFace Datasets `.filter` functionality.
|
45 |
+
```
|
46 |
+
dataset = dataset.filter(lambda ex: ex["subset"] == "Factuality")
|
47 |
+
```
|
48 |
+
|
49 |
+
## Models Used
|
50 |
+
We generated completions from the following models:
|
51 |
+
- [Mistral 7B Instruct v0.3](https://huggingface.co/mistralai/Mistral-7B-v0.3) (Apache 2.0)
|
52 |
+
- [Tulu 3 8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) (Llama 3.1 Community License Agreement)
|
53 |
+
- [Tulu 3 70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) (Llama 3.1 Community License Agreement)
|
54 |
+
- [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) (Llama 3.1 Community License Agreement)
|
55 |
+
- [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (Llama 3.1 Community License Agreement)
|
56 |
+
- [Llama 3.2 1B Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) (Llama 3.2 Community License Agreement)
|
57 |
+
- [Llama 2 7B Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) (Llama 2 Community License Agreement)
|
58 |
+
- [Tulu 2 70B](https://huggingface.co/allenai/tulu-2-dpo-70b) (Ai2 ImpACT Low Risk License)
|
59 |
+
- [Qwen2.5 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) (Qwen License Agreement)
|
60 |
+
- [Qwen2.5 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) (Apache 2.0)
|
61 |
+
- [Qwen2.5 14B Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) (Apache 2.0)
|
62 |
+
- [Qwen2.5 0.5B Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) (Apache 2.0)
|
63 |
+
- [Qwen2.5 Math 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-72B-Instruct) (Qwen License Agreement)
|
64 |
+
- [Qwen2.5 Math 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) (Apache 2.0)
|
65 |
+
- [Deepseek Math 7B RL](https://huggingface.co/deepseek-ai/deepseek-math-7b-rl) (This model is licensed under the Deepseek License. Any use of the outputs from this model must be in accordance with the use restrictions in the [Deepseek License](https://github.com/deepseek-ai/DeepSeek-Math/blob/main/LICENSE-MODEL).)
|
66 |
+
- [OLMoE 1B 7B 0924 Instruct](https://huggingface.co/allenai/OLMoE-1B-7B-0924) (Apache 2.0)
|
67 |
+
- [Dolphin 2.0 Mistral 7b](https://huggingface.co/cognitivecomputations/dolphin-2.0-mistral-7b) (Apache 2.0)
|
68 |
+
- [Zephyr 7b Beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) (MIT License)
|
69 |
+
- GPT-4o (Outputs produced by GPT-4 are subject to OpenAI's [terms of use](https://openai.com/policies/row-terms-of-use/))
|
70 |
+
- Claude 3.5 Sonnet (Outputs produced by Claude are subject to Anthropic [terms of service](https://www.anthropic.com/legal/consumer-terms) and [usage policy](https://www.anthropic.com/legal/aup))
|
71 |
+
|
72 |
+
## License
|
73 |
+
This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use). This dataset includes output data generated from third party models that are subject to separate terms governing their use.
|
74 |
+
|
75 |
+
## Trained Reward Models
|
76 |
+
We also trained and released several reward models— check out the [RewardBench 2 Collection](https://huggingface.co/collections/allenai/reward-bench-2-683d2612a4b3e38a3e53bb51) to use them!
|
77 |
+
"""
|
78 |
|
79 |
ABOUT_TEXT_V1 = """
|
80 |
We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
|
|
|
178 |
|
179 |
CAPTION_V2 = f"""The *new version* of RewardBench that is based on unseen human data and designed to be substantially more difficult!
|
180 |
|
181 |
+
[Code](https://github.com/allenai/reward-bench) | [Eval. Dataset v2](https://huggingface.co/datasets/allenai/reward-bench-2) | [Results v2](https://huggingface.co/datasets/allenai/reward-bench-2-results) | [Paper](https://github.com/allenai/reward-bench/blob/main/paper-v2.pdf) | Total models: {{}} | Last restart (PST): {current_time}"""
|
182 |
|
183 |
CAPTION_V1 = f"""The original RewardBench -- the first reward model evaluation.
|
184 |
|