update readme
Browse files
README.md
CHANGED
@@ -45,6 +45,8 @@ Several versions of the model weights are available on Hugging Face:
|
|
45 |
* **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
|
46 |
* **Parameters:** ~2 Billion
|
47 |
* **Training Tokens:** 4 Trillion
|
|
|
|
|
48 |
* **Training Stages:**
|
49 |
1. **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
|
50 |
2. **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
|
|
|
45 |
* **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
|
46 |
* **Parameters:** ~2 Billion
|
47 |
* **Training Tokens:** 4 Trillion
|
48 |
+
* **Context Length:** Maximum sequence length of **4096 tokens**.
|
49 |
+
* *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
|
50 |
* **Training Stages:**
|
51 |
1. **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
|
52 |
2. **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
|