DeepSeek-GRM-27B
Introduction
In this work, we present DeepSeek-GRM, a Generative Reward Model (GRM) empowered by Self-Principled Critique Tuning (SPCT). For any query-and-response pair(s), the model first writes its own evaluation principle, then produces a free-form critique, and finally emits discrete 1-to-10 scores—forming a transparent “principle → critique → score” pipeline that can be inspected or sampled multiple times at inference for more granular judgments. This generative formulation boosts the reward generation quality in a broad range of RM tasks, and could benefit applications such as evaluation, offline data collection, and online reinforcement learning (RL) on LLMs.
SPCT contains a two-stage recipe that combines rejective fine-tuning with rule-based online RL. Extensive experiments show that the model establishes significant improvement in RM performance compared to existing RM approaches on benchmarks including RewardBench, PPE, RMB, and ReaLMistake; when drawn 32 parallel samples and selected via the Meta RM, it surpasses much larger GRMs with only inference-time scaling. These findings make DeepSeek-GRM a promising candidate for local deployment, especially with the 16B and 27B variants. For production use, we recommend DeepSeek-V3 which is superior due to its larger scale and more up-to-date training data.
Model Summary
Base LLM: Two variants of DeepSeek-GRM are available, DeepSeek-GRM-16B is based on DeepSeek-V2-Lite, and DeepSeek-GRM-27B is based on Gemma-2-27B. For the 671B model, please refer to DeepSeek-V3.
Training Recipe: The model follows the pointwise GRM approach and is trained under the Self-Principled Critique Tuning (SPCT) framework, which consists of two stages:
- Rejective Fine-Tuning: The model is fine-tuned on a sampled dataset from DeepSeek-V2.5-0906, including 1070K general instruction data and 186K rejective sampled data.
- Rule-Based RL: The model is further trained using GRPO on 237K samples with rule-based rewards that encourages it to produce high-quality principles and critiques.
In practice, we observe that the trained model adaptively generates the principles and critiques, i.e., for simple queries and small amount of responses (like 2), it tends to produce more concise and fixed principles, but for more complex queries or large amount of responses (like 4 or more), it generates more detailed and flexible principles with longer critiques.
The used open-source datasets include MATH, UltraFeedback, OffsetBias, Skywork-Reward-Preference-80K-v0.2, and HelpSteer2-Preference. The training process uses 128 A100 GPUs for ~35 hours. Please refer to the paper for more details.
Intended Use:
- RL with Non-Verifiable Tasks – plug-and-play reward for GRPO-based RL on LLMs. In our experiments, DeepSeek-GRM faces performance drop on reasoning-extensive queries and tasks largely relying on external factual knowledge. Rule-based verifier is a good option to mitigate this issue.
- LLM Evaluation/Data Collection – automatic judge for LLM outputs with or without reference answers or pre-defined principles.
- Further Research – study self-principle-based reward shaping and inference-time scaling for RMs.
Not for: final medical / legal decisions and such risky scenarios without human review.
Inference-Time Scaling Strategy: Due to the randomness in reward generation, we recommend sampling multiple responses and using a voting mechanism, which often leads to better reward quality.
- Parallel Sampling – Draw $k$ independent “principle → critique → score” trajectories for each evaluation item. Each item represents a query and a set of responses to be evaluated, where each response would be graded by the model with a discrete score from 1 to 10.
- Voting / Aggregation – Convert all trajectory’s discrete 1-10 scores into a vote by summing the scores for each response from different judging principles. It could lead to more robust and accurate final rewards.
- Meta RM Filtering – Meta RM is a classifier that predicts the probability of a judgement being correct. It can be used to filter out low-quality judgement from the model before voting, which further improves the final reward quality.
Evaluation Results
Comparison with Existing RM Approaches
Model | Reward Bench | PPE Pref. | PPE Corr. | RMB | Overall |
---|---|---|---|---|---|
LLM-as-a-Judge | 83.4 | 64.2 | 58.8 | 64.8 | 67.8 |
DeepSeek-BTRM-27B | 81.7 | 68.3 | 66.7 | 57.9 | 68.6 |
CLooud-Gemma-2-27B | 82.0 | 67.1 | 62.4 | 63.4 | 68.7 |
DeepSeek-PairRM-27B | 87.1 | 65.8 | 64.8 | 58.2 | 69.0 |
DeepSeek-GRM-27B (Ours) | 86.0 | 64.7 | 59.8 | 69.0 | 69.9 |
Notice: LLM-as-a-Judge also follows the training procedure of SPCT with principle generation.
Ablation Study for SPCT
Method | Overall |
---|---|
Results of Greedy Decoding | |
DeepSeek-GRM-27B | 69.9 |
w/o Principle Generation | 67.5 |
w/o Rejective Sampling | 68.7 |
DeepSeek-GRM-27B-RFT | 68.8 |
w/o Rejective Sampled Data | 66.1 |
w/o General Instruction Data | 63.3 |
Results of Inference-Time Scaling (Voting @ 8) | |
DeepSeek-GRM-27B | 70.6 |
w/o Principle Generation | 68.0 |
Inference-Time Scaling Performance
Setting | Reward Bench | PPE Pref. | PPE Corr. | RMB | Overall |
---|---|---|---|---|---|
DeepSeek-GRM-27B | |||||
+ Greedy | 86.0 | 64.7 | 59.8 | 69.0 | 69.9 |
+ Vote @ 8 | 87.7 | 64.9 | 60.3 | 69.5 | 70.6 |
+ Vote @ 32 | 88.5 | 65.3 | 60.4 | 69.7 | 71.0 |
+ Vote @ 32 w/ MetaRM | 90.4 | 67.2 | 63.2 | 70.3 | 72.8 |
Quick-Start
Reward Generation
For single-turn queries,
import re import torch from transformers import pipeline pipe = pipeline( "text-generation", model="BBQGOD/DeepSeek-GRM-27B", model_kwargs={"torch_dtype": torch.bfloat16}, device="cuda", # replace with "mps" to run on a Mac device ) def extract_last_floats(text: str) -> list[float]: pattern = re.compile( r'(?:\\{1,2}boxed\{|\[)' r'\s*([^\]\}]+?)\s*' r'(?:\}|\])' ) matches = list(pattern.finditer(text)) if not matches: return [] last_content = matches[-1].group(1) parts = re.split(r'\s*,\s*', last_content.strip()) floats = [] for p in parts: try: floats.append(float(p)) except ValueError: pass return floats messages = [{ "role": "user", "content": "You are a skilled little expert at scoring responses. You should evaluate given responses based on the given judging criteria.\nGiven the context of the conversation (the last round is the User's query) and multiple responses from the Assistant, you need to refer to the [General Evaluation Criteria] to score the responses. Based on the general evaluation criteria, state potential other specific criteria to the query, the weights of different criteria, and then provide an overall comprehensive score upon them.\nEach score is an integer between 1 and 10, with a higher score indicating that the response meets the relevant criteria more closely. For example, a score of 1 means the response does not meet the criteria at all, a score of 6 means the response meets only some parts, and a score of 10 means the response perfectly meets the evaluation criteria.\nBefore scoring, please analyze step by step. Your scoring needs to be as strict as possible.\n#### Evaluation Criteria ####\n1. Instruction Adherence:\n - Fully Adhered (9-10 points): The response fully complies with all instructions and requirements of the question.\n - Partially Adhered (6-8 points): The response meets most of the instructions but has some omissions or misunderstandings.\n - Basically Adhered (3-5 points): The response meets some instructions, but the main requirements are not fulfilled.\n - Not Adhered (1-2 points): The response does not meet any instructions.\n Example: If the question requires three examples and the response provides only one, it falls under \"Partially Adhered.\"\n2. Usefulness:\n - Highly Useful (9-10 points): The response provides comprehensive and accurate information, fully addressing the issue.\n - Useful but Incomplete (6-8 points): The response provides some useful information, but lacks details or accuracy.\n - Limited Usefulness (3-5 points): The response offers little useful information, with most content being irrelevant or incorrect.\n - Useless or Incorrect (1-2 points): The response is completely irrelevant or incorrect.\n Example: If there are factual errors in the response but the overall direction is correct, it falls under \"Useful but Incomplete.\"\n3. Level of Detail:\n - Very Detailed (9-10 points): The response includes ample details covering all aspects of the issue.\n - Detailed but Slightly Lacking (6-8 points): The response is fairly detailed but misses some important details.\n - Basically Detailed (3-5 points): The response provides some details but is not thorough enough overall.\n - Not Detailed (1-2 points): The response is very brief and lacks necessary details.\n Example: If the response provides only a simple conclusion without an explanation, it falls under \"Not Detailed.\"\n4. Relevance:\n - Highly Relevant (9-10 points): The response is highly relevant to the question, with information closely aligned with the topic.\n - Generally Relevant (6-8 points): The response is generally relevant but includes some unnecessary information.\n - Partially Relevant (3-5 points): The response has a lot of content that deviates from the topic.\n - Not Relevant (1-2 points): The response is completely irrelevant.\n Example: If the response strays from the topic but still provides some relevant information, it falls under \"Partially Relevant.\"\n\n#### Conversation Context ####\n{}\n#### Responses to be Scored ####\n{}\n\n#### Output Format Requirements ####\n\nOutput with three lines\nSpecific Criteria: <Other potential criteria specific to the query and the context, and the weights of each criteria>.\nAnalysis: <Compare different responses based on given Criteria>.\nScores: <the overall comprehensive score of all resposnes in order, seperate by comma in the boxed, e.g., \\boxed{{x, x}} if there exists 2 responeses>." }] single_turn_query = "You are an expert in the uses and creation of spreadsheets in Microsoft Excel. Create a formula that will look for the same entry in cell CB5 from columns D and E and then return all entries from the corresponding rows of these entries from Q5 to T5 and return the last 6 entries in Q5 to T500 by taking the date from cell CK4 and looking up column B" single_turn_responses = [ "Unfortunately I do not have access to the spreadsheet data you referenced, but I can provide an example formula that attempts to accomplish the described logic:\n\n```\n=IF(OR(D5=CB5,E5=CB5),INDEX(Q5:T5,MATCH(CB5,{D5:E5},0)),INDEX(Q5:T500,MATCH(MAX(IF(B5:B500<=CK4,B5:B500)),B5:B500,0)-5))\n```\n\nBreakdown:\n\n- The IF statement checks if CB5 matches either D5 or E5 using OR. \n- If there is a match, it returns the range Q5:T5 for that row using INDEX and MATCH.\n- If no match, it returns the last 6 rows of Q:T before the date in CK4. It does this by finding the MAX date less than CK4 using MAX and IF, then matching that date using MATCH and returning 6 rows above it using INDEX.\n\nLet me know if you need any clarification or have additional requirements!", # Best Response "To achieve the desired outcome, you can use the following formula in cell T501:\n```\n=INDEX(OFFSET(OFFSET(Q5:T5,MATCH(MAX(D$5:E$5),D$5:E$5,0)),1,6),1)\n```\nThis formula uses the `INDEX` and `OFFSET` functions to retrieve the last 6 entries from columns Q5 to T5 starting from the row with the maximum value in either column D or E. Here's how the formula works:\n\n1. `MATCH(MAX(D$5:E$5),D$5:E$5,0)`: This returns the row number of the cell with the maximum value in columns D5 to E5.\n2. `OFFSET(OFFSET(Q5:T5,MATCH(MAX(D$5:E$5),D$5:E$5,0)),1,6)`: This retrieves a range of 6 cells starting from the row returned by the above formula in columns Q5 to T5.\n3. `INDEX(OFFSET(OFFSET(Q5:T5,MATCH(MAX(D$5:E$5),D$5:E$5,0)),1,6),1)`: This returns the value in the first row of the range retrieved in step 2.\n\nFinally, to link the date in cell CK4 to the first entry in this range, you can use the following formula in cell T500:\n```\n=IF(T501=CK4,T501, \"\")\n```\nThis formula checks whether the value in cell T501 is the same as the date in cell CK4. If they are the same, it returns the value in cell T501. Otherwise, it returns an empty string.", " To create a formula that does the task you described, you can use the following formula in cell B500:\n\n=JOIN(\" \",IF(ISNUMBER(MATCH(CB5,D:E,0)),D5:E5,\"\"))\n\nHere's how the formula works:\n* The JOIN function is used to combine the rows of data from columns D and E where the value in cell CB5 exists.\n* The IF function is used to check if the value in cell CB5 exists in columns D and E. If it does, then the formula returns the values in columns D and E for those rows. If it doesn't, then the formula returns an empty string.\n* The MATCH function is used to find the row number where the value in cell CB5 exists in columns D and E. If the value is found, then the row number is returned as an integer. If the value is not found, then the function returns #N/A.\n* The D5:E5 and E5:E5 ranges are used to specify the cells to look up in columns D and E.\n* The last 6 entries in Q5 to T500 are selected by using the following formula in cell B500:\n=JOIN(\" \",IF(ISNUMBER(MATCH(CB5,D:E,0)),D5:E5, \"\"))\n\nThis formula will look for the same entry in cell CB5 in columns D and E, and then return all entries from the corresponding rows of these columns from Q5 to T5. It will also return the last 6 entries in Q5 to T500 by taking the date from cell CK4 and looking up column B.\nNote: The above formula assumes that the data is sorted in ascending order in columns D and E based on the date in cell CK4. If the data is not sorted, you may need to adjust the formula accordingly.", ] def assemble_responses(resp_list): return "".join([f"[The Begin of Response {i+1}]\n{resp}\n[The End of Response {i+1}]\n\n" for i, resp in enumerate(resp_list)]) messages[0]["content"] = messages[0]["content"].format(single_turn_query, assemble_responses(single_turn_responses)) outputs = pipe(messages, max_new_tokens=2048, temperature=1.0, do_sample=True) judgement = outputs[0]["generated_text"][-1]["content"].strip() print(judgement) # DeepSeek-GRM tends to generate direct and fixed set of principles when facing simple queries. rewards = extract_last_floats(judgement) print(rewards)
For multi-turn queries,
import re import torch from transformers import pipeline pipe = pipeline( "text-generation", model="BBQGOD/DeepSeek-GRM-27B", model_kwargs={"torch_dtype": torch.bfloat16}, device="cuda", # replace with "mps" to run on a Mac device ) def extract_last_floats(text: str) -> list[float]: pattern = re.compile( r'(?:\\{1,2}boxed\{|\[)' r'\s*([^\]\}]+?)\s*' r'(?:\}|\])' ) matches = list(pattern.finditer(text)) if not matches: return [] last_content = matches[-1].group(1) parts = re.split(r'\s*,\s*', last_content.strip()) floats = [] for p in parts: try: floats.append(float(p)) except ValueError: pass return floats messages = [{ "role": "user", "content": "You are a skilled little expert at scoring responses. You should evaluate given responses based on the given judging criteria.\nGiven the context of the conversation (the last round is the User's query) and multiple responses from the Assistant, you need to refer to the [General Evaluation Criteria] to score the responses. Based on the general evaluation criteria, state potential other specific criteria to the query, the weights of different criteria, and then provide an overall comprehensive score upon them.\nEach score is an integer between 1 and 10, with a higher score indicating that the response meets the relevant criteria more closely. For example, a score of 1 means the response does not meet the criteria at all, a score of 6 means the response meets only some parts, and a score of 10 means the response perfectly meets the evaluation criteria.\nBefore scoring, please analyze step by step. Your scoring needs to be as strict as possible.\n#### Evaluation Criteria ####\n1. Instruction Adherence:\n - Fully Adhered (9-10 points): The response fully complies with all instructions and requirements of the question.\n - Partially Adhered (6-8 points): The response meets most of the instructions but has some omissions or misunderstandings.\n - Basically Adhered (3-5 points): The response meets some instructions, but the main requirements are not fulfilled.\n - Not Adhered (1-2 points): The response does not meet any instructions.\n Example: If the question requires three examples and the response provides only one, it falls under \"Partially Adhered.\"\n2. Usefulness:\n - Highly Useful (9-10 points): The response provides comprehensive and accurate information, fully addressing the issue.\n - Useful but Incomplete (6-8 points): The response provides some useful information, but lacks details or accuracy.\n - Limited Usefulness (3-5 points): The response offers little useful information, with most content being irrelevant or incorrect.\n - Useless or Incorrect (1-2 points): The response is completely irrelevant or incorrect.\n Example: If there are factual errors in the response but the overall direction is correct, it falls under \"Useful but Incomplete.\"\n3. Level of Detail:\n - Very Detailed (9-10 points): The response includes ample details covering all aspects of the issue.\n - Detailed but Slightly Lacking (6-8 points): The response is fairly detailed but misses some important details.\n - Basically Detailed (3-5 points): The response provides some details but is not thorough enough overall.\n - Not Detailed (1-2 points): The response is very brief and lacks necessary details.\n Example: If the response provides only a simple conclusion without an explanation, it falls under \"Not Detailed.\"\n4. Relevance:\n - Highly Relevant (9-10 points): The response is highly relevant to the question, with information closely aligned with the topic.\n - Generally Relevant (6-8 points): The response is generally relevant but includes some unnecessary information.\n - Partially Relevant (3-5 points): The response has a lot of content that deviates from the topic.\n - Not Relevant (1-2 points): The response is completely irrelevant.\n Example: If the response strays from the topic but still provides some relevant information, it falls under \"Partially Relevant.\"\n\n#### Conversation Context ####\n{}\n#### Responses to be Scored ####\n{}\n\n#### Output Format Requirements ####\n\nOutput with three lines\nSpecific Criteria: <Other potential criteria specific to the query and the context, and the weights of each criteria>.\nAnalysis: <Compare different responses based on given Criteria>.\nScores: <the overall comprehensive score of all resposnes in order, seperate by comma in the boxed, e.g., \\boxed{{x, x}} if there exists 2 responeses>." }] multi_turn_query = [ { "role": "user", "content": "Argument against \"equal pay for equal work enforcement\":\n[2:09, 2/5/2023] Enrique: Let's suppose women are less productive than men, then you are making unproductive women having a hard time finding jobs. If they aren't less productive, then there is a cost imposed to employers who turn them down for no good reason. But if you demand equal pay for equal work then they can't bear this cost. \n[2:10, 2/5/2023] Enrique: So what now? you will force employers to employ an unproductive person? \n" }, { "role": "assistant", "content": "Counterargument: \n[2:13, 2/5/2023] Response: First and foremost, the premise that women are less productive than men is a baseless stereotype that reinforces gender discrimination. There is no scientific evidence to support this claim, so it is unjust to assume that women are inherently less productive when considering equal pay for equal work.\n\nEnforcing equal pay for equal work is not about forcing employers to employ unproductive individuals; it's about eliminating discrimination and ensuring that all employees are paid fairly based on their experience, qualifications, and job performance. This principle applies to all workers, irrespective of their gender.\n\nIf the concern is about hiring unproductive individuals, then employers should focus on establishing clear performance metrics and using those as a basis for compensation rather than making assumptions based on gender. By evaluating all employees based on their performance, employers can ensure they are paying for actual productivity and not perpetuating gender-based wage disparities.\n\nFurthermore, equal pay for equal work enforcement does not mean employers cannot fire or let go of unproductive employees. It only requires that while they are employed, they should receive wages commensurate with their job responsibilities and performance, regardless of gender.\n\nUltimately, enforcing equal pay for equal work ensures that employees are compensated fairly, which is not only a basic human right but also a way to create a more inclusive and equitable work environment." }, { "role": "user", "content": "i did not want a counterargument. it's an argument from Friedman, and he wasn't sexist." } ] def assemble_query(message_list): assert message_list[-1]["role"] == "user", "The last message must be from the user." return "\n\n".join([f"{msg['role'].capitalize()}: {msg['content']}" for msg in message_list]) multi_turn_query = assemble_query(multi_turn_query) multi_turn_responses = [ "I apologize for misunderstanding your request and for incorrectly attributing bias to Friedman's argument. You're right that this is an economic argument from Milton Friedman, not a sexist statement. Let me summarize the argument more accurately:\n\nFriedman's argument against enforcing \"equal pay for equal work\":\n\n1. If there truly is a productivity difference between groups, enforcing equal pay could make it harder for less productive workers to find jobs, as employers would be forced to pay them more than their productivity warrants.\n\n2. If there isn't a real productivity difference, discriminating employers are already bearing a cost by turning away qualified workers. Enforcing equal pay removes their ability to bear this cost, potentially leading to other forms of discrimination.\n\n3. The policy creates a dilemma: either it forces employers to hire potentially less productive workers at higher wages, or it attempts to force hiring decisions, both of which interfere with market mechanisms.\n\n4. The argument suggests that market forces, rather than legislation, would be more effective at addressing wage disparities over time.\n\nThis economic perspective focuses on potential unintended consequences of the policy, without making claims about inherent abilities of different groups.", # Best Response " My apologies for misunderstanding your request. Here's a revised response that addresses Milton Friedman's argument without presenting a counterargument:\n\nMilton Friedman's argument that equal pay for equal work enforcement can lead to unproductive women having a hard time finding jobs is based on the assumption that women are inherently less productive than men. However, this assumption is not supported by empirical evidence. Numerous studies have shown that women can be just as productive as men when given equal opportunities and treated fairly.\n\nFriedman's argument also ignores the fact that gender discrimination is a real and persistent issue in the workplace. Women have historically been paid less than men for the same work, and this wage gap persists today. Enforcing equal pay for equal work is a way to address this issue and ensure that women are not unfairly penalized for their gender.\n\nFurthermore, Friedman's argument assumes that employers are able to accurately measure productivity and that they will always make hiring decisions based on merit rather than gender. However, research has shown that biases and stereotypes can influence hiring decisions, and women may be less likely to be hired or promoted into higher-paying positions. Enforcing equal pay for equal work can help to mitigate these biases and ensure that women have equal opportunities in the workplace.\n\nIn summary, while Milton Friedman's argument against equal pay for equal work enforcement may have been well-intentioned, it is based on flawed assumptions and ignores the realities of gender discrimination in the workplace. Enforcing equal pay for equal work is an important step towards creating a fair and equitable work environment for all employees, regardless of gender.", "My apologies for providing a counterargument. Here is the argument from Friedman, without the sexist premise:\n\n**Argument against \"equal pay for equal work enforcement\":**\n\nLet's suppose that women are equally productive as men. If you demand equal pay for equal work, then employers will have to pay women the same as men, even if they are less productive. This will impose a cost on employers, who will then be less likely to hire women. As a result, unproductive women will have a harder time finding jobs.\n\n**Friedman's argument is based on the following assumptions:**\n\n* Women are less productive than men.\n* Employers will not hire women if they have to pay them the same as men.\n* Unproductive women will not be able to find jobs if they are not paid the same as men.\n\n**However, there is no evidence to support these assumptions.**\n\n* Studies have shown that women are just as productive as men.\n* Many employers are willing to hire women, even if they have to pay them the same as men.\n* Unproductive women can still find jobs, even if they are not paid the same as men.\n\n**Therefore, Friedman's argument against equal pay for equal work enforcement is flawed.**\n\nIt is important to note that Friedman was not sexist. He believed that women should have the same opportunities as men. However, he also believed that the free market should be allowed to operate without government interference. He argued that equal pay for equal work laws would interfere with the free market and would ultimately harm women.\n\n**However, there is a growing body of evidence to suggest that equal pay for equal work laws actually benefit women.** These laws help to close the gender pay gap and make it easier for women to find jobs. They also help to create a more level playing field for women in the workplace.", ] def assemble_responses(resp_list): return "".join([f"[The Begin of Response {i+1}]\n{resp}\n[The End of Response {i+1}]\n\n" for i, resp in enumerate(resp_list)]) messages[0]["content"] = messages[0]["content"].format(multi_turn_query, assemble_responses(multi_turn_responses)) outputs = pipe(messages, max_new_tokens=2048, temperature=1.0, do_sample=True) judgement = outputs[0]["generated_text"][-1]["content"].strip() print(judgement) # DeepSeek-GRM tends to generate more flexible principles when facing complex queries and larger amount of responses. rewards = extract_last_floats(judgement) print(rewards)
Reward Generation with Pre-defined Principles
In addition to generate the whole trajectory, including principles and critiques, from scratch, DeepSeek-GRM can also be used to generate rewards based on pre-defined principles by nature. This is useful when you want to evaluate the model outputs based on a fixed set of judging principles across different queries.
import re
import torch
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="BBQGOD/DeepSeek-GRM-27B",
model_kwargs={"torch_dtype": torch.bfloat16},
device="cuda", # replace with "mps" to run on a Mac device
)
def extract_last_floats(text: str) -> list[float]:
pattern = re.compile(
r'(?:\\{1,2}boxed\{|\[)'
r'\s*([^\]\}]+?)\s*'
r'(?:\}|\])'
)
matches = list(pattern.finditer(text))
if not matches:
return []
last_content = matches[-1].group(1)
parts = re.split(r'\s*,\s*', last_content.strip())
floats = []
for p in parts:
try:
floats.append(float(p))
except ValueError:
pass
return floats
messages = [{
"role": "user",
"content": "You are a skilled little expert at scoring responses. You should evaluate given responses based on the given judging criteria.\nGiven the context of the conversation (the last round is the User's query) and multiple responses from the Assistant, you need to refer to the [General Evaluation Criteria] to score the responses. Based on the general evaluation criteria, state potential other specific criteria to the query, the weights of different criteria, and then provide an overall comprehensive score upon them.\nEach score is an integer between 1 and 10, with a higher score indicating that the response meets the relevant criteria more closely. For example, a score of 1 means the response does not meet the criteria at all, a score of 6 means the response meets only some parts, and a score of 10 means the response perfectly meets the evaluation criteria.\nBefore scoring, please analyze step by step. Your scoring needs to be as strict as possible.\n#### Evaluation Criteria ####\n1. Instruction Adherence:\n - Fully Adhered (9-10 points): The response fully complies with all instructions and requirements of the question.\n - Partially Adhered (6-8 points): The response meets most of the instructions but has some omissions or misunderstandings.\n - Basically Adhered (3-5 points): The response meets some instructions, but the main requirements are not fulfilled.\n - Not Adhered (1-2 points): The response does not meet any instructions.\n Example: If the question requires three examples and the response provides only one, it falls under \"Partially Adhered.\"\n2. Usefulness:\n - Highly Useful (9-10 points): The response provides comprehensive and accurate information, fully addressing the issue.\n - Useful but Incomplete (6-8 points): The response provides some useful information, but lacks details or accuracy.\n - Limited Usefulness (3-5 points): The response offers little useful information, with most content being irrelevant or incorrect.\n - Useless or Incorrect (1-2 points): The response is completely irrelevant or incorrect.\n Example: If there are factual errors in the response but the overall direction is correct, it falls under \"Useful but Incomplete.\"\n3. Level of Detail:\n - Very Detailed (9-10 points): The response includes ample details covering all aspects of the issue.\n - Detailed but Slightly Lacking (6-8 points): The response is fairly detailed but misses some important details.\n - Basically Detailed (3-5 points): The response provides some details but is not thorough enough overall.\n - Not Detailed (1-2 points): The response is very brief and lacks necessary details.\n Example: If the response provides only a simple conclusion without an explanation, it falls under \"Not Detailed.\"\n4. Relevance:\n - Highly Relevant (9-10 points): The response is highly relevant to the question, with information closely aligned with the topic.\n - Generally Relevant (6-8 points): The response is generally relevant but includes some unnecessary information.\n - Partially Relevant (3-5 points): The response has a lot of content that deviates from the topic.\n - Not Relevant (1-2 points): The response is completely irrelevant.\n Example: If the response strays from the topic but still provides some relevant information, it falls under \"Partially Relevant.\"\n\n#### Conversation Context ####\n{}\n#### Responses to be Scored ####\n{}\n\n#### Output Format Requirements ####\n\nOutput with three lines\nSpecific Criteria: <Other potential criteria specific to the query and the context, and the weights of each criteria>.\nAnalysis: <Compare different responses based on given Criteria>.\nScores: <the overall comprehensive score of all resposnes in order, seperate by comma in the boxed, e.g., \\boxed{{x, x}} if there exists 2 responeses>."
}]
single_turn_query = "You are an expert in the uses and creation of spreadsheets in Microsoft Excel. Create a formula that will look for the same entry in cell CB5 from columns D and E and then return all entries from the corresponding rows of these entries from Q5 to T5 and return the last 6 entries in Q5 to T500 by taking the date from cell CK4 and looking up column B"
single_turn_responses = [
"Unfortunately I do not have access to the spreadsheet data you referenced, but I can provide an example formula that attempts to accomplish the described logic:\n\n```\n=IF(OR(D5=CB5,E5=CB5),INDEX(Q5:T5,MATCH(CB5,{D5:E5},0)),INDEX(Q5:T500,MATCH(MAX(IF(B5:B500<=CK4,B5:B500)),B5:B500,0)-5))\n```\n\nBreakdown:\n\n- The IF statement checks if CB5 matches either D5 or E5 using OR. \n- If there is a match, it returns the range Q5:T5 for that row using INDEX and MATCH.\n- If no match, it returns the last 6 rows of Q:T before the date in CK4. It does this by finding the MAX date less than CK4 using MAX and IF, then matching that date using MATCH and returning 6 rows above it using INDEX.\n\nLet me know if you need any clarification or have additional requirements!", # Best Response
"To achieve the desired outcome, you can use the following formula in cell T501:\n```\n=INDEX(OFFSET(OFFSET(Q5:T5,MATCH(MAX(D$5:E$5),D$5:E$5,0)),1,6),1)\n```\nThis formula uses the `INDEX` and `OFFSET` functions to retrieve the last 6 entries from columns Q5 to T5 starting from the row with the maximum value in either column D or E. Here's how the formula works:\n\n1. `MATCH(MAX(D$5:E$5),D$5:E$5,0)`: This returns the row number of the cell with the maximum value in columns D5 to E5.\n2. `OFFSET(OFFSET(Q5:T5,MATCH(MAX(D$5:E$5),D$5:E$5,0)),1,6)`: This retrieves a range of 6 cells starting from the row returned by the above formula in columns Q5 to T5.\n3. `INDEX(OFFSET(OFFSET(Q5:T5,MATCH(MAX(D$5:E$5),D$5:E$5,0)),1,6),1)`: This returns the value in the first row of the range retrieved in step 2.\n\nFinally, to link the date in cell CK4 to the first entry in this range, you can use the following formula in cell T500:\n```\n=IF(T501=CK4,T501, \"\")\n```\nThis formula checks whether the value in cell T501 is the same as the date in cell CK4. If they are the same, it returns the value in cell T501. Otherwise, it returns an empty string.",
" To create a formula that does the task you described, you can use the following formula in cell B500:\n\n=JOIN(\" \",IF(ISNUMBER(MATCH(CB5,D:E,0)),D5:E5,\"\"))\n\nHere's how the formula works:\n* The JOIN function is used to combine the rows of data from columns D and E where the value in cell CB5 exists.\n* The IF function is used to check if the value in cell CB5 exists in columns D and E. If it does, then the formula returns the values in columns D and E for those rows. If it doesn't, then the formula returns an empty string.\n* The MATCH function is used to find the row number where the value in cell CB5 exists in columns D and E. If the value is found, then the row number is returned as an integer. If the value is not found, then the function returns #N/A.\n* The D5:E5 and E5:E5 ranges are used to specify the cells to look up in columns D and E.\n* The last 6 entries in Q5 to T500 are selected by using the following formula in cell B500:\n=JOIN(\" \",IF(ISNUMBER(MATCH(CB5,D:E,0)),D5:E5, \"\"))\n\nThis formula will look for the same entry in cell CB5 in columns D and E, and then return all entries from the corresponding rows of these columns from Q5 to T5. It will also return the last 6 entries in Q5 to T500 by taking the date from cell CK4 and looking up column B.\nNote: The above formula assumes that the data is sorted in ascending order in columns D and E based on the date in cell CK4. If the data is not sorted, you may need to adjust the formula accordingly.",
]
def assemble_responses(resp_list):
return "".join([f"[The Begin of Response {i+1}]\n{resp}\n[The End of Response {i+1}]\n\n" for i, resp in enumerate(resp_list)])
messages[0]["content"] = messages[0]["content"].format(single_turn_query, assemble_responses(single_turn_responses))
# put the predefined principles
messages.append({
"role": "assistant",
"content": """Specific Criteria:
1. Accuracy of Formula (Weight: 50%): The formula should correctly implement the logic described in the user's query.
2. Clarity of Explanation (Weight: 30%): The explanation should clearly describe how the formula works and why it is appropriate for the task.
3. Applicability (Weight: 10%): The formula should be applicable to the actual data range and conditions specified by the user.
4. Error Handling (Weight: 5%): The formula should handle potential errors or edge cases, such as no match found in columns D and E.
5. General Criteria Above (Weight: 5%)."""
})
outputs = pipe(messages, max_new_tokens=2048, temperature=1.0, do_sample=True)
judgement = outputs[0]["generated_text"][-1]["content"].strip()
print(judgement)
rewards = extract_last_floats(judgement)
print(rewards)
Reward Generation with Inference-Time Scaling
For reward generation with inference-time scaling, we can implement a simple voting method:
import re
import numpy as np
import torch
import random
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="BBQGOD/DeepSeek-GRM-27B",
model_kwargs={"torch_dtype": torch.bfloat16},
device="cuda", # or "mps"
)
def extract_last_floats(text: str) -> list[float]:
pattern = re.compile(
r'(?:\\{1,2}boxed\{|\[)'
r'\s*([^\]\}]+?)\s*'
r'(?:\}|\])'
)
matches = list(pattern.finditer(text))
if not matches:
return []
last_content = matches[-1].group(1)
parts = re.split(r'\s*,\s*', last_content.strip())
floats = []
for p in parts:
try:
floats.append(float(p))
except ValueError:
pass
return floats
messages = [{
"role": "user",
"content": "You are a skilled little expert at scoring responses. You should evaluate given responses based on the given judging criteria.\nGiven the context of the conversation (the last round is the User's query) and multiple responses from the Assistant, you need to refer to the [General Evaluation Criteria] to score the responses. Based on the general evaluation criteria, state potential other specific criteria to the query, the weights of different criteria, and then provide an overall comprehensive score upon them.\nEach score is an integer between 1 and 10, with a higher score indicating that the response meets the relevant criteria more closely. For example, a score of 1 means the response does not meet the criteria at all, a score of 6 means the response meets only some parts, and a score of 10 means the response perfectly meets the evaluation criteria.\nBefore scoring, please analyze step by step. Your scoring needs to be as strict as possible.\n#### Evaluation Criteria ####\n1. Instruction Adherence:\n - Fully Adhered (9-10 points): The response fully complies with all instructions and requirements of the question.\n - Partially Adhered (6-8 points): The response meets most of the instructions but has some omissions or misunderstandings.\n - Basically Adhered (3-5 points): The response meets some instructions, but the main requirements are not fulfilled.\n - Not Adhered (1-2 points): The response does not meet any instructions.\n Example: If the question requires three examples and the response provides only one, it falls under \"Partially Adhered.\"\n2. Usefulness:\n - Highly Useful (9-10 points): The response provides comprehensive and accurate information, fully addressing the issue.\n - Useful but Incomplete (6-8 points): The response provides some useful information, but lacks details or accuracy.\n - Limited Usefulness (3-5 points): The response offers little useful information, with most content being irrelevant or incorrect.\n - Useless or Incorrect (1-2 points): The response is completely irrelevant or incorrect.\n Example: If there are factual errors in the response but the overall direction is correct, it falls under \"Useful but Incomplete.\"\n3. Level of Detail:\n - Very Detailed (9-10 points): The response includes ample details covering all aspects of the issue.\n - Detailed but Slightly Lacking (6-8 points): The response is fairly detailed but misses some important details.\n - Basically Detailed (3-5 points): The response provides some details but is not thorough enough overall.\n - Not Detailed (1-2 points): The response is very brief and lacks necessary details.\n Example: If the response provides only a simple conclusion without an explanation, it falls under \"Not Detailed.\"\n4. Relevance:\n - Highly Relevant (9-10 points): The response is highly relevant to the question, with information closely aligned with the topic.\n - Generally Relevant (6-8 points): The response is generally relevant but includes some unnecessary information.\n - Partially Relevant (3-5 points): The response has a lot of content that deviates from the topic.\n - Not Relevant (1-2 points): The response is completely irrelevant.\n Example: If the response strays from the topic but still provides some relevant information, it falls under \"Partially Relevant.\"\n\n#### Conversation Context ####\n{}\n#### Responses to be Scored ####\n{}\n\n#### Output Format Requirements ####\n\nOutput with three lines\nSpecific Criteria: <Other potential criteria specific to the query and the context, and the weights of each criteria>.\nAnalysis: <Compare different responses based on given Criteria>.\nScores: <the overall comprehensive score of all resposnes in order, seperate by comma in the boxed, e.g., \\boxed{{x, x}} if there exists 2 responeses>."
}]
single_turn_query = "..."
single_turn_responses = [
"...",
"...",
"...",
]
def assemble_responses(resp_list):
return "".join(
f"[The Begin of Response {i+1}]\n{resp}\n[The End of Response {i+1}]\n\n"
for i, resp in enumerate(resp_list)
)
k = 5 # number of parallel samples
N = len(single_turn_responses)
total_scores = [0.0] * N
for round_idx in range(k):
indexed = list(enumerate(single_turn_responses))
random.shuffle(indexed)
shuffled_indices, shuffled_resps = zip(*indexed)
tmp_messages = [{
"role": "user",
"content": messages[0]["content"].format(
single_turn_query, assemble_responses(shuffled_resps)
)
}]
outputs = pipe(tmp_messages, max_new_tokens=2048, temperature=1.0, do_sample=True)
judgement = outputs[0]["generated_text"][-1]["content"].strip()
rewards = extract_last_floats(judgement)
print(f"Sample {round_idx + 1} rewards: {rewards}")
for pos_in_shuffled, orig_idx in enumerate(shuffled_indices):
if pos_in_shuffled < len(rewards):
total_scores[orig_idx] += np.clip(rewards[pos_in_shuffled], a_min=1.0, a_max=10.0)
print("Voted scores over", k, "samples:", total_scores)
or a Meta RM guided method:
import torch
from transformers import Gemma2ForSequenceClassification, AutoTokenizer
device = torch.device("cuda")
model_path = "BBQGOD/DeepSeek-GRM-27B-MetaRM"
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer_nobos = AutoTokenizer.from_pretrained(model_path, add_bos_token=False, bos_token="")
model = Gemma2ForSequenceClassification.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto"
)
model.eval()
def meta_rm_template(_context, _response):
return "Please score the responses.\n\n#### Conversation Context ####\n{}#### Responses to be Scored ####\n{}".format(_context, _response)
context = "User: I have a nice riddle for you…. Can you find the the mistake?\n\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73"
response = [
" Sure, I'd be happy to help you find the mistake in your riddle!\n\nThe mistake is actually quite subtle, but it's there. Can you guess what it is?\n\nHere's a hint: think about the number of letters in each line of the riddle.\n\nDo you see it now?\n\nThe mistake is that the riddle contains 73 lines, but the last line doesn't have 73 letters. It has 72 letters.\n\nGreat job on creating a challenging riddle, and I hope you enjoyed this little puzzle!",
"The mistake in your sequence is that it ends at 73, but it should continue with 74, 75, and so on. If you're looking for a specific type of mistake or a different interpretation, please let me know!",
"Unfortunately I don't see any obvious mistakes in the list of numbers you provided. It seems to just be the integers from 1 to 73 written out in sequential order. Unless there is some subtle pattern or rule that should have been followed that I'm missing, I don't notice any issues with the list. Please let me know if there's something specific I should be looking for as the mistake! Riddles can sometimes be tricky and rely on spotting something that's easy to overlook.", # Best Response
" The mistake in the riddle is the repeated number 23. There are only 26 numbers in the list, but it is listed twice. Therefore, the correct answer is 26.",
"The mistake is that the number 13 is missing."
]
def assemble_responses(resp_list):
return "\n\n".join([f"[The Begin of Response {i+1}]\n{resp}\n[The End of Response {i+1}]" for i, resp in enumerate(resp_list)])
response = assemble_responses(response)
# (judgement, reward)
judgement_list = [
("Specific Criteria: \n1. Riddle Interpretation (Weight: 30%): How accurately the response interprets the riddle's intent and identifies the mistake.\n2. Clarity of Explanation (Weight: 20%): How clearly the response explains the identified mistake.\n3. Engagement (Weight: 10%): How engaging and interactive the response is with the user.\n4. Correctness (Weight: 40%): How correct the identified mistake is based on the riddle's context.\n\nAnalysis:\n- **Response 1**: This response partially adheres to the instructions by identifying a mistake related to the number of letters in each line. However, the riddle's intent seems to be numerical, not textual, making this interpretation incorrect. The explanation is clear but misdirected. The engagement is high with a hint and a follow-up question.\n- **Response 2**: This response fully adheres to the instructions by suggesting a numerical continuation beyond 73, which aligns with the numerical nature of the riddle. The explanation is clear and straightforward. However, it misses the subtle hint of the riddle's intent.\n- **Response 3**: This response fully adheres to the instructions by acknowledging the numerical sequence and asking for specific guidance if there's a subtle pattern missed. It is highly relevant and engages the user to clarify the riddle's intent. The explanation is clear and correct in its approach.\n- **Response 4**: This response partially adheres to the instructions by identifying a repeated number but misinterprets the riddle's context. The explanation is clear but incorrect in identifying the mistake.\n- **Response 5**: This response partially adheres to the instructions by identifying a missing number but does not explain why or how this is relevant to the riddle. The explanation is unclear and incorrect in its approach.\n\nScores: \\boxed{5, 7, 9, 4, 3}", [5.0, 7.0, 9.0, 4.0, 3.0]),
("Specific Criteria: \n1. Riddle Interpretation Accuracy (Weight: 40%): How accurately the response interprets the riddle's intended mistake.\n2. Engagement and Interaction (Weight: 20%): How well the response engages the user and encourages interaction.\n3. Clarity and Explanation (Weight: 20%): How clear and well-explained the response is.\n4. Originality of Solution (Weight: 20%): How original and creative the proposed solution is.\n\nAnalysis:\n- **Response 1**: This response accurately identifies the mistake by focusing on the number of letters in each line, which is a subtle and clever interpretation of the riddle. It engages the user by asking them to guess and provides a clear explanation. The solution is original and fits well with the riddle's format.\n- **Response 2**: This response suggests a continuation of the sequence, which does not address the riddle's intended mistake. It lacks engagement and clarity, and the solution is not original or relevant to the riddle's context.\n- **Response 3**: This response fails to identify any mistake and merely states that the list is sequential. It lacks engagement, clarity, and originality. The solution does not address the riddle's intended mistake.\n- **Response 4**: This response incorrectly identifies a repeated number and a specific count, which is not relevant to the riddle's context. It lacks engagement and clarity, and the solution is not original or accurate.\n- **Response 5**: This response incorrectly identifies a missing number, which is not relevant to the riddle's context. It lacks engagement and clarity, and the solution is not original or accurate.\n\nScores: \\boxed{9, 3, 2, 2, 2}", [9.0, 3.0, 2.0, 2.0, 2.0])
]
inputs = [
[
{
"role": "user",
"content": meta_rm_template(context, response)
},
{
"role": "assistant",
"content": judgement
}
] for judgement, _ in judgement_list
]
texts = [tokenizer_nobos.apply_chat_template(input, add_generation_prompt=False, tokenize=False) for input in inputs]
inputs = [tokenizer(text, return_tensors="pt") for text in texts]
outputs = []
with torch.no_grad():
for input in inputs:
input.to(device)
output = model(**input)
outputs.append(output.logits[0].item())
for i, text in enumerate(texts):
print(f"> 输入: {text}")
print(f" Logits: {outputs[i]}")
print()
scores_and_rewards = list(zip(outputs, [r for _, r in judgement_list]))
scores_and_rewards.sort(key=lambda x: x[0], reverse=True)
top_half = scores_and_rewards[:len(scores_and_rewards) // 2]
top_reward_lists = [rewards for _, rewards in top_half]
print("Top-half reward lists:", top_reward_lists)
final_rewards = [sum(vals) for vals in zip(*top_reward_lists)]
print("Final reward list:", final_rewards)
Limitations & Potential Bias
Despite the significant performance improvement, residual toxicity or domain skew can surface, especially outside the English and Chinese corpus or in strictly verifiable tasks where scalar RMs/Rule-based verifiers still excel. Manual spot-checks remain advisable under risky scenarios.
Citation
@misc{liu2025inferencetimescalinggeneralistreward,
title={Inference-Time Scaling for Generalist Reward Modeling},
author={Zijun Liu and Peiyi Wang and Runxin Xu and Shirong Ma and Chong Ruan and Peng Li and Yang Liu and Yu Wu},
year={2025},
eprint={2504.02495},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.02495},
}
License: Gemma
- Downloads last month
- 47