Safetensors
qwen2
lbourdois commited on
Commit
e4e54cc
·
verified ·
1 Parent(s): 5e6ba94

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +56 -42
README.md CHANGED
@@ -1,42 +1,56 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - Quest-AI/quest-corruption-truncated4grpo-6k-dataset-v1
5
- base_model:
6
- - Qwen/Qwen2.5-7B
7
- ---
8
- Custom pseudo "fill in the middle" trained model, designed to handle varying "corruption rates" (randomized UTF8 character substitution).
9
- Two custom GRPO reward functions were used to improve the pre-existing SFT trained model in order to have it more reliably attend to the XML styling.
10
-
11
- Designed to be used with the (jank, hacky, personalized) PyQT GUI tooling seen at:
12
- https://github.com/kalomaze/quest-tools
13
-
14
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/aTAHg9aIzKIzul4sYCTRw.png)
15
-
16
- Wandb logs for this run can be found [here,](https://wandb.ai/kalomaze/verifiers-examples/runs/7l9b6pvi?nw=nwuserkalomaze) as well as the [attached RL code](https://wandb.ai/kalomaze/verifiers-examples/runs/7l9b6pvi/files/run_files_20250224_112244).
17
- Full hyperparameters are observable in the configuration py as well.
18
-
19
- ## Prompt Formatting
20
- Trained without ChatML templating. This model uses a pattern of:
21
-
22
- - Raw "corrupted" text at the beginning with UTF8 substitution for parts of the input.
23
- - The "objective" as a Claude-style XML tag with newline separators.
24
- - The beginning of an "original" tag.
25
-
26
- ```
27
- def _format_prompt(self, example: Dict) -> str:
28
- return (
29
- f"{example['corrupted']}\n\n"
30
- "<objective>\n"
31
- "gently repair the <original> content\n"
32
- "</objective>\n\n"
33
- "<original>\n"
34
- )
35
- ```
36
-
37
- The primary utility of this model is as a means to synthesize rejected / lower quality preference data from pre-existing SFT data (i.e, the general pretraining corpus).
38
- This is useful in the context of teaching a reward model **generalized preferences** from lower quality, subtly incoherent base model-esque completions, of which are trivial to produce compared to human annotations.
39
-
40
- ## Acknowledgements
41
-
42
- Trained on 8xH200s provided free of charge by [Deepshard](https://github.com/deepshard) for research & open source experimentation. Big McThankies.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Quest-AI/quest-corruption-truncated4grpo-6k-dataset-v1
5
+ base_model:
6
+ - Qwen/Qwen2.5-7B
7
+ language:
8
+ - zho
9
+ - eng
10
+ - fra
11
+ - spa
12
+ - por
13
+ - deu
14
+ - ita
15
+ - rus
16
+ - jpn
17
+ - kor
18
+ - vie
19
+ - tha
20
+ - ara
21
+ ---
22
+ Custom pseudo "fill in the middle" trained model, designed to handle varying "corruption rates" (randomized UTF8 character substitution).
23
+ Two custom GRPO reward functions were used to improve the pre-existing SFT trained model in order to have it more reliably attend to the XML styling.
24
+
25
+ Designed to be used with the (jank, hacky, personalized) PyQT GUI tooling seen at:
26
+ https://github.com/kalomaze/quest-tools
27
+
28
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/aTAHg9aIzKIzul4sYCTRw.png)
29
+
30
+ Wandb logs for this run can be found [here,](https://wandb.ai/kalomaze/verifiers-examples/runs/7l9b6pvi?nw=nwuserkalomaze) as well as the [attached RL code](https://wandb.ai/kalomaze/verifiers-examples/runs/7l9b6pvi/files/run_files_20250224_112244).
31
+ Full hyperparameters are observable in the configuration py as well.
32
+
33
+ ## Prompt Formatting
34
+ Trained without ChatML templating. This model uses a pattern of:
35
+
36
+ - Raw "corrupted" text at the beginning with UTF8 substitution for parts of the input.
37
+ - The "objective" as a Claude-style XML tag with newline separators.
38
+ - The beginning of an "original" tag.
39
+
40
+ ```
41
+ def _format_prompt(self, example: Dict) -> str:
42
+ return (
43
+ f"{example['corrupted']}\n\n"
44
+ "<objective>\n"
45
+ "gently repair the <original> content\n"
46
+ "</objective>\n\n"
47
+ "<original>\n"
48
+ )
49
+ ```
50
+
51
+ The primary utility of this model is as a means to synthesize rejected / lower quality preference data from pre-existing SFT data (i.e, the general pretraining corpus).
52
+ This is useful in the context of teaching a reward model **generalized preferences** from lower quality, subtly incoherent base model-esque completions, of which are trivial to produce compared to human annotations.
53
+
54
+ ## Acknowledgements
55
+
56
+ Trained on 8xH200s provided free of charge by [Deepshard](https://github.com/deepshard) for research & open source experimentation. Big McThankies.