natolambert commited on
Commit
bfbc587
·
verified ·
1 Parent(s): f0c8dc0

Update leaderboard/md.py

Browse files
Files changed (1) hide show
  1. leaderboard/md.py +74 -2
leaderboard/md.py CHANGED
@@ -2,7 +2,79 @@ from datetime import datetime
2
 
3
  import pytz
4
 
5
- ABOUT_TEXT_V2 = """TODO"""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  ABOUT_TEXT_V1 = """
8
  We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
@@ -106,7 +178,7 @@ TOP_TEXT = """# RewardBench: Evaluating Reward Models"""
106
 
107
  CAPTION_V2 = f"""The *new version* of RewardBench that is based on unseen human data and designed to be substantially more difficult!
108
 
109
- [Code](https://github.com/allenai/reward-bench) | [Eval. Dataset v2](https://huggingface.co/datasets/allenai/reward-bench-2) | [Results v2](https://huggingface.co/datasets/allenai/reward-bench-2-results) | [Paper (TODO)](TODO) | Total models: {{}} | Last restart (PST): {current_time}"""
110
 
111
  CAPTION_V1 = f"""The original RewardBench -- the first reward model evaluation.
112
 
 
2
 
3
  import pytz
4
 
5
+ ABOUT_TEXT_V2 = """
6
+ The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories:
7
+ 1. **Factuality** (*NEW!*): Tests the ability of RMs to detect hallucinations and other basic errors in completions.
8
+ 2. **Precise Instruction Following** (*NEW!*): Tests the ability of RMs to judge whether text follows precise instructions, such as "Answer without the letter u".
9
+ 3. **Math**: Tests RMs' abilities at math, on open-ended human prompts ranging from middle school physics and geometry to college-level chemistry, calculus, combinatorics, and more.
10
+ 4. **Safety**: Tests RMs' abilities to correctly comply with or refuse prompts related to harmful use cases as well as general compliance behaviors.
11
+ 5. **Focus**: Tests RMs' ability to detect high-quality, on-topic answers to general user queries.
12
+ 6. **Ties** (*NEW*!): This new type of subset tests the robustness of RMs in domains with many possible similar answers. For example, the question "Name a color of the rainbow" has seven possible correct answers and infinitely many incorrect ones.
13
+
14
+ The RewardBench 2 leaderboard averages over these six subsets.
15
+ For the first five categories, the scoring for RewardBench 2 evaluates success as whether the score of a prompt-chosen pair is greater than the score of *three* prompt-rejected pairs.
16
+ The "Ties" score is a weighted score of accuracy (as measured by *all* valid correct answers being scored higher than *all* incorrect answers) and whether the reward margin between correct and incorrect answers exceeds that of the highest and lowest-scored correct responses. This metric rewards not only correctness, but also a model's ability to prioritize correct answers over incorrect ones more strongly than it distinguishes between equally valid correct responses.
17
+
18
+ <img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/reward-bench/main-fig-hor.png" alt="RewardBench 2 Flow" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
19
+
20
+ ## Dataset Construction Summary
21
+ | Domain | Count | Prompt Source | Method of generating completions | Completion Filtering |
22
+ |--------|-------|---------------|----------------------------------|---------------------|
23
+ | Factuality | 475 | Human | Both | Multi-LM-as-a-judge |
24
+ | Precise IF | 160 | Human | Natural | Verifier functions |
25
+ | Math | 183 | Human | Natural | Majority voting |
26
+ | Safety | 450 | CoCoNot | Both | LM-as-a-judge & rubrics |
27
+ | Focus | 495 | Human | System Prompt Variation | N/A |
28
+ | Ties | 102 | Manual | System Prompt Variation | Manual verification |
29
+
30
+ ## Dataset Details
31
+
32
+ Each sample in the dataset has the following items.
33
+ Note, the dataset is single-turn:
34
+ * `prompt` (`str`): the instruction given in the various test sets.
35
+ * `chosen` (`list[str]`): the chosen response(s) (1 chosen response for all subsets but ties)
36
+ * `rejected` (`list[str]`): the rejected responses (3 chosen responses for all subsets but ties)
37
+ * `num_correct` (`int`): the number of chosen responses
38
+ * `num_rejected` (`int`): the number of rejected responses
39
+ * `total_completions` (`int`): the total number of responses
40
+ * `models` (`list[str]`): a list of models that the chosen and rejected responses are generated from, respectively
41
+ * `subset` (`str`): the subset the datapoint is part of.
42
+ * `id` (`int`): an incremented id for every prompt in the benchmark.
43
+
44
+ To select a specific subset use HuggingFace Datasets `.filter` functionality.
45
+ ```
46
+ dataset = dataset.filter(lambda ex: ex["subset"] == "Factuality")
47
+ ```
48
+
49
+ ## Models Used
50
+ We generated completions from the following models:
51
+ - [Mistral 7B Instruct v0.3](https://huggingface.co/mistralai/Mistral-7B-v0.3) (Apache 2.0)
52
+ - [Tulu 3 8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) (Llama 3.1 Community License Agreement)
53
+ - [Tulu 3 70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) (Llama 3.1 Community License Agreement)
54
+ - [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) (Llama 3.1 Community License Agreement)
55
+ - [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (Llama 3.1 Community License Agreement)
56
+ - [Llama 3.2 1B Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) (Llama 3.2 Community License Agreement)
57
+ - [Llama 2 7B Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) (Llama 2 Community License Agreement)
58
+ - [Tulu 2 70B](https://huggingface.co/allenai/tulu-2-dpo-70b) (Ai2 ImpACT Low Risk License)
59
+ - [Qwen2.5 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) (Qwen License Agreement)
60
+ - [Qwen2.5 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) (Apache 2.0)
61
+ - [Qwen2.5 14B Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) (Apache 2.0)
62
+ - [Qwen2.5 0.5B Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) (Apache 2.0)
63
+ - [Qwen2.5 Math 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-72B-Instruct) (Qwen License Agreement)
64
+ - [Qwen2.5 Math 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) (Apache 2.0)
65
+ - [Deepseek Math 7B RL](https://huggingface.co/deepseek-ai/deepseek-math-7b-rl) (This model is licensed under the Deepseek License. Any use of the outputs from this model must be in accordance with the use restrictions in the [Deepseek License](https://github.com/deepseek-ai/DeepSeek-Math/blob/main/LICENSE-MODEL).)
66
+ - [OLMoE 1B 7B 0924 Instruct](https://huggingface.co/allenai/OLMoE-1B-7B-0924) (Apache 2.0)
67
+ - [Dolphin 2.0 Mistral 7b](https://huggingface.co/cognitivecomputations/dolphin-2.0-mistral-7b) (Apache 2.0)
68
+ - [Zephyr 7b Beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) (MIT License)
69
+ - GPT-4o (Outputs produced by GPT-4 are subject to OpenAI's [terms of use](https://openai.com/policies/row-terms-of-use/))
70
+ - Claude 3.5 Sonnet (Outputs produced by Claude are subject to Anthropic [terms of service](https://www.anthropic.com/legal/consumer-terms) and [usage policy](https://www.anthropic.com/legal/aup))
71
+
72
+ ## License
73
+ This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use). This dataset includes output data generated from third party models that are subject to separate terms governing their use.
74
+
75
+ ## Trained Reward Models
76
+ We also trained and released several reward models— check out the [RewardBench 2 Collection](https://huggingface.co/collections/allenai/reward-bench-2-683d2612a4b3e38a3e53bb51) to use them!
77
+ """
78
 
79
  ABOUT_TEXT_V1 = """
80
  We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
 
178
 
179
  CAPTION_V2 = f"""The *new version* of RewardBench that is based on unseen human data and designed to be substantially more difficult!
180
 
181
+ [Code](https://github.com/allenai/reward-bench) | [Eval. Dataset v2](https://huggingface.co/datasets/allenai/reward-bench-2) | [Results v2](https://huggingface.co/datasets/allenai/reward-bench-2-results) | [Paper](https://github.com/allenai/reward-bench/blob/main/paper-v2.pdf) | Total models: {{}} | Last restart (PST): {current_time}"""
182
 
183
  CAPTION_V1 = f"""The original RewardBench -- the first reward model evaluation.
184