---
license: apache-2.0
datasets:
- Quest-AI/quest-corruption-truncated4grpo-6k-dataset-v1
base_model:
- Qwen/Qwen2.5-7B
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
---
Custom pseudo "fill in the middle" trained model, designed to handle varying "corruption rates" (randomized UTF8 character substitution).
Two custom GRPO reward functions were used to improve the pre-existing SFT trained model in order to have it more reliably attend to the XML styling.

Designed to be used with the (jank, hacky, personalized) PyQT GUI tooling seen at:
https://github.com/kalomaze/quest-tools

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/aTAHg9aIzKIzul4sYCTRw.png)

Wandb logs for this run can be found [here,](https://wandb.ai/kalomaze/verifiers-examples/runs/7l9b6pvi?nw=nwuserkalomaze) as well as the [attached RL code](https://wandb.ai/kalomaze/verifiers-examples/runs/7l9b6pvi/files/run_files_20250224_112244).
Full hyperparameters are observable in the configuration py as well.

## Prompt Formatting
Trained without ChatML templating. This model uses a pattern of:

- Raw "corrupted" text at the beginning with UTF8 substitution for parts of the input.
- The "objective" as a Claude-style XML tag with newline separators.
- The beginning of an "original" tag.
  
```
    def _format_prompt(self, example: Dict) -> str:
        return (
            f"{example['corrupted']}\n\n"
            "<objective>\n"
            "gently repair the <original> content\n"
            "</objective>\n\n"
            "<original>\n"
        )
```

The primary utility of this model is as a means to synthesize rejected / lower quality preference data from pre-existing SFT data (i.e, the general pretraining corpus).
This is useful in the context of teaching a reward model **generalized preferences** from lower quality, subtly incoherent base model-esque completions, of which are trivial to produce compared to human annotations.

## Acknowledgements

Trained on 8xH200s provided free of charge by [Deepshard](https://github.com/deepshard) for research & open source experimentation. Big McThankies.