zhixuan-lin
/

delta_net-760m-longcrawl64-48b

@@ -1,14 +1,17 @@
 ---
 library_name: transformers
-tags: []
 ---
 # DeltaNet Model Checkpoint for the Forgetting Transformer Paper
 The final checkpoint for the 760M-parameter DeltaNet model in the main experiment of the ICLR 2025 paper [Forgetting Transformer: Softmax Attention with a Forget Gate](https://arxiv.org/abs/2503.02130).
 ## Model Details
 ### Model Description
@@ -85,15 +88,11 @@ out = model(input_ids=input_ids)
 assert out.logits.size() == (batch_size, seq_len, tokenizer.vocab_size)
 ```
 ## Limitations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 This is a small model trained on a small number of tokens from LongCrawl64, provided for reproducibility and research purposes. Also, as a long-context dataset for research purposes, LongCrawl64 is not designed for optimal downstream task performance (it also has a strange tokenization process, see [here](https://github.com/zhixuan-lin/forgetting-transformer/blob/main/src/forgetting_transformer/tokenizer.py)). Therefore, this model is only suitable for research purposes (e.g., inspecting attention maps). Also, if you want to compare this model with other models trained in another setting with another dataset, **you should definitely train it from scratch on your own dataset under your own setting for the comparison.**
 ## Training Details
 ### Training Data
@@ -106,8 +105,6 @@ This model is trained on roughly 48B tokens on LongCrawl64, with a training cont
 Please see [our paper](https://arxiv.org/abs/2503.02130) for details. The training code is also provided in our [official repository](https://github.com/zhixuan-lin/forgetting-transformer).
 **BibTeX:**
 ```

 ---
 library_name: transformers
+tags:
+  - long-context
+  - forgetting-attention
+  - deltanet
+pipeline_tag: text-generation
+license: mit
 ---
 # DeltaNet Model Checkpoint for the Forgetting Transformer Paper
 The final checkpoint for the 760M-parameter DeltaNet model in the main experiment of the ICLR 2025 paper [Forgetting Transformer: Softmax Attention with a Forget Gate](https://arxiv.org/abs/2503.02130).
 ## Model Details
 ### Model Description
 assert out.logits.size() == (batch_size, seq_len, tokenizer.vocab_size)
 ```
 ## Limitations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 This is a small model trained on a small number of tokens from LongCrawl64, provided for reproducibility and research purposes. Also, as a long-context dataset for research purposes, LongCrawl64 is not designed for optimal downstream task performance (it also has a strange tokenization process, see [here](https://github.com/zhixuan-lin/forgetting-transformer/blob/main/src/forgetting_transformer/tokenizer.py)). Therefore, this model is only suitable for research purposes (e.g., inspecting attention maps). Also, if you want to compare this model with other models trained in another setting with another dataset, **you should definitely train it from scratch on your own dataset under your own setting for the comparison.**
 ## Training Details
 ### Training Data
 Please see [our paper](https://arxiv.org/abs/2503.02130) for details. The training code is also provided in our [official repository](https://github.com/zhixuan-lin/forgetting-transformer).
 **BibTeX:**
 ```