microsoft
/

bitnet-b1.58-2B-4T

Text Generation

large-language-model

8-bit precision

Model card Files Files and versions Community

shumingma commited on 18 days ago

Commit

dc4f215

·

1 Parent(s): f709283

update readme

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -45,6 +45,8 @@ Several versions of the model weights are available on Hugging Face:
     * **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
 * **Parameters:** ~2 Billion
 * **Training Tokens:** 4 Trillion
 * **Training Stages:**
     1.  **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
     2.  **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.

     * **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
 * **Parameters:** ~2 Billion
 * **Training Tokens:** 4 Trillion
+*   **Context Length:** Maximum sequence length of **4096 tokens**.
+    *   *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
 * **Training Stages:**
     1.  **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
     2.  **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.