Quest-AI
/

quest-corruption-7b-s375-v3-GRPO

Safetensors

qwen2

Model card Files Files and versions Community

Improve language tag

by lbourdois - opened 12 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+56

-42

Files changed (1) hide show

README.md +56 -42

README.md CHANGED Viewed

@@ -1,42 +1,56 @@
----
-license: apache-2.0
-datasets:
-- Quest-AI/quest-corruption-truncated4grpo-6k-dataset-v1
-base_model:
-- Qwen/Qwen2.5-7B
----
-Custom pseudo "fill in the middle" trained model, designed to handle varying "corruption rates" (randomized UTF8 character substitution).
-Two custom GRPO reward functions were used to improve the pre-existing SFT trained model in order to have it more reliably attend to the XML styling.
-Designed to be used with the (jank, hacky, personalized) PyQT GUI tooling seen at:
-https://github.com/kalomaze/quest-tools
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/aTAHg9aIzKIzul4sYCTRw.png)
-Wandb logs for this run can be found [here,](https://wandb.ai/kalomaze/verifiers-examples/runs/7l9b6pvi?nw=nwuserkalomaze) as well as the [attached RL code](https://wandb.ai/kalomaze/verifiers-examples/runs/7l9b6pvi/files/run_files_20250224_112244).
-Full hyperparameters are observable in the configuration py as well.
-## Prompt Formatting
-Trained without ChatML templating. This model uses a pattern of:
-- Raw "corrupted" text at the beginning with UTF8 substitution for parts of the input.
-- The "objective" as a Claude-style XML tag with newline separators.
-- The beginning of an "original" tag.
-```
-    def _format_prompt(self, example: Dict) -> str:
-        return (
-            f"{example['corrupted']}\n\n"
-            "<objective>\n"
-            "gently repair the <original> content\n"
-            "</objective>\n\n"
-            "<original>\n"
-        )
-```
-The primary utility of this model is as a means to synthesize rejected / lower quality preference data from pre-existing SFT data (i.e, the general pretraining corpus).
-This is useful in the context of teaching a reward model **generalized preferences** from lower quality, subtly incoherent base model-esque completions, of which are trivial to produce compared to human annotations.
-## Acknowledgements
-Trained on 8xH200s provided free of charge by [Deepshard](https://github.com/deepshard) for research & open source experimentation. Big McThankies.

+---
+license: apache-2.0
+datasets:
+- Quest-AI/quest-corruption-truncated4grpo-6k-dataset-v1
+base_model:
+- Qwen/Qwen2.5-7B
+language:
+- zho
+- eng
+- fra
+- spa
+- por
+- deu
+- ita
+- rus
+- jpn
+- kor
+- vie
+- tha
+- ara
+---
+Custom pseudo "fill in the middle" trained model, designed to handle varying "corruption rates" (randomized UTF8 character substitution).
+Two custom GRPO reward functions were used to improve the pre-existing SFT trained model in order to have it more reliably attend to the XML styling.
+Designed to be used with the (jank, hacky, personalized) PyQT GUI tooling seen at:
+https://github.com/kalomaze/quest-tools
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/aTAHg9aIzKIzul4sYCTRw.png)
+Wandb logs for this run can be found [here,](https://wandb.ai/kalomaze/verifiers-examples/runs/7l9b6pvi?nw=nwuserkalomaze) as well as the [attached RL code](https://wandb.ai/kalomaze/verifiers-examples/runs/7l9b6pvi/files/run_files_20250224_112244).
+Full hyperparameters are observable in the configuration py as well.
+## Prompt Formatting
+Trained without ChatML templating. This model uses a pattern of:
+- Raw "corrupted" text at the beginning with UTF8 substitution for parts of the input.
+- The "objective" as a Claude-style XML tag with newline separators.
+- The beginning of an "original" tag.
+```
+    def _format_prompt(self, example: Dict) -> str:
+        return (
+            f"{example['corrupted']}\n\n"
+            "<objective>\n"
+            "gently repair the <original> content\n"
+            "</objective>\n\n"
+            "<original>\n"
+        )
+```
+The primary utility of this model is as a means to synthesize rejected / lower quality preference data from pre-existing SFT data (i.e, the general pretraining corpus).
+This is useful in the context of teaching a reward model **generalized preferences** from lower quality, subtly incoherent base model-esque completions, of which are trivial to produce compared to human annotations.
+## Acknowledgements
+Trained on 8xH200s provided free of charge by [Deepshard](https://github.com/deepshard) for research & open source experimentation. Big McThankies.