shumingma commited on
Commit
dc4f215
·
1 Parent(s): f709283

update readme

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -45,6 +45,8 @@ Several versions of the model weights are available on Hugging Face:
45
  * **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
46
  * **Parameters:** ~2 Billion
47
  * **Training Tokens:** 4 Trillion
 
 
48
  * **Training Stages:**
49
  1. **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
50
  2. **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
 
45
  * **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
46
  * **Parameters:** ~2 Billion
47
  * **Training Tokens:** 4 Trillion
48
+ * **Context Length:** Maximum sequence length of **4096 tokens**.
49
+ * *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
50
  * **Training Stages:**
51
  1. **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
52
  2. **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.