---
license: apache-2.0
datasets:
- Quest-AI/quest-corruption-truncated4grpo-6k-dataset-v1
base_model:
- Qwen/Qwen2.5-7B
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
---
Custom pseudo "fill in the middle" trained model, designed to handle varying "corruption rates" (randomized UTF8 character substitution).
Two custom GRPO reward functions were used to improve the pre-existing SFT trained model in order to have it more reliably attend to the XML styling.
Designed to be used with the (jank, hacky, personalized) PyQT GUI tooling seen at:
https://github.com/kalomaze/quest-tools

Wandb logs for this run can be found [here,](https://wandb.ai/kalomaze/verifiers-examples/runs/7l9b6pvi?nw=nwuserkalomaze) as well as the [attached RL code](https://wandb.ai/kalomaze/verifiers-examples/runs/7l9b6pvi/files/run_files_20250224_112244).
Full hyperparameters are observable in the configuration py as well.
## Prompt Formatting
Trained without ChatML templating. This model uses a pattern of:
- Raw "corrupted" text at the beginning with UTF8 substitution for parts of the input.
- The "objective" as a Claude-style XML tag with newline separators.
- The beginning of an "original" tag.
```
def _format_prompt(self, example: Dict) -> str:
return (
f"{example['corrupted']}\n\n"
"\n"
"gently repair the content\n"
"\n\n"
"\n"
)
```
The primary utility of this model is as a means to synthesize rejected / lower quality preference data from pre-existing SFT data (i.e, the general pretraining corpus).
This is useful in the context of teaching a reward model **generalized preferences** from lower quality, subtly incoherent base model-esque completions, of which are trivial to produce compared to human annotations.
## Acknowledgements
Trained on 8xH200s provided free of charge by [Deepshard](https://github.com/deepshard) for research & open source experimentation. Big McThankies.