--- license: apache-2.0 datasets: - Quest-AI/quest-corruption-truncated4grpo-6k-dataset-v1 base_model: - Qwen/Qwen2.5-7B language: - zho - eng - fra - spa - por - deu - ita - rus - jpn - kor - vie - tha - ara --- Custom pseudo "fill in the middle" trained model, designed to handle varying "corruption rates" (randomized UTF8 character substitution). Two custom GRPO reward functions were used to improve the pre-existing SFT trained model in order to have it more reliably attend to the XML styling. Designed to be used with the (jank, hacky, personalized) PyQT GUI tooling seen at: https://github.com/kalomaze/quest-tools ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/aTAHg9aIzKIzul4sYCTRw.png) Wandb logs for this run can be found [here,](https://wandb.ai/kalomaze/verifiers-examples/runs/7l9b6pvi?nw=nwuserkalomaze) as well as the [attached RL code](https://wandb.ai/kalomaze/verifiers-examples/runs/7l9b6pvi/files/run_files_20250224_112244). Full hyperparameters are observable in the configuration py as well. ## Prompt Formatting Trained without ChatML templating. This model uses a pattern of: - Raw "corrupted" text at the beginning with UTF8 substitution for parts of the input. - The "objective" as a Claude-style XML tag with newline separators. - The beginning of an "original" tag. ``` def _format_prompt(self, example: Dict) -> str: return ( f"{example['corrupted']}\n\n" "\n" "gently repair the content\n" "\n\n" "\n" ) ``` The primary utility of this model is as a means to synthesize rejected / lower quality preference data from pre-existing SFT data (i.e, the general pretraining corpus). This is useful in the context of teaching a reward model **generalized preferences** from lower quality, subtly incoherent base model-esque completions, of which are trivial to produce compared to human annotations. ## Acknowledgements Trained on 8xH200s provided free of charge by [Deepshard](https://github.com/deepshard) for research & open source experimentation. Big McThankies.