Add pipeline tag, tags and license to metadata

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +6 -9
README.md CHANGED
@@ -1,14 +1,17 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
4
  ---
5
 
6
  # DeltaNet Model Checkpoint for the Forgetting Transformer Paper
7
 
8
  The final checkpoint for the 760M-parameter DeltaNet model in the main experiment of the ICLR 2025 paper [Forgetting Transformer: Softmax Attention with a Forget Gate](https://arxiv.org/abs/2503.02130).
9
 
10
-
11
-
12
  ## Model Details
13
 
14
  ### Model Description
@@ -85,15 +88,11 @@ out = model(input_ids=input_ids)
85
  assert out.logits.size() == (batch_size, seq_len, tokenizer.vocab_size)
86
  ```
87
 
88
-
89
-
90
  ## Limitations
91
 
92
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
93
  This is a small model trained on a small number of tokens from LongCrawl64, provided for reproducibility and research purposes. Also, as a long-context dataset for research purposes, LongCrawl64 is not designed for optimal downstream task performance (it also has a strange tokenization process, see [here](https://github.com/zhixuan-lin/forgetting-transformer/blob/main/src/forgetting_transformer/tokenizer.py)). Therefore, this model is only suitable for research purposes (e.g., inspecting attention maps). Also, if you want to compare this model with other models trained in another setting with another dataset, **you should definitely train it from scratch on your own dataset under your own setting for the comparison.**
94
 
95
-
96
-
97
  ## Training Details
98
 
99
  ### Training Data
@@ -106,8 +105,6 @@ This model is trained on roughly 48B tokens on LongCrawl64, with a training cont
106
 
107
  Please see [our paper](https://arxiv.org/abs/2503.02130) for details. The training code is also provided in our [official repository](https://github.com/zhixuan-lin/forgetting-transformer).
108
 
109
-
110
-
111
  **BibTeX:**
112
 
113
  ```
 
1
  ---
2
  library_name: transformers
3
+ tags:
4
+ - long-context
5
+ - forgetting-attention
6
+ - deltanet
7
+ pipeline_tag: text-generation
8
+ license: mit
9
  ---
10
 
11
  # DeltaNet Model Checkpoint for the Forgetting Transformer Paper
12
 
13
  The final checkpoint for the 760M-parameter DeltaNet model in the main experiment of the ICLR 2025 paper [Forgetting Transformer: Softmax Attention with a Forget Gate](https://arxiv.org/abs/2503.02130).
14
 
 
 
15
  ## Model Details
16
 
17
  ### Model Description
 
88
  assert out.logits.size() == (batch_size, seq_len, tokenizer.vocab_size)
89
  ```
90
 
 
 
91
  ## Limitations
92
 
93
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
94
  This is a small model trained on a small number of tokens from LongCrawl64, provided for reproducibility and research purposes. Also, as a long-context dataset for research purposes, LongCrawl64 is not designed for optimal downstream task performance (it also has a strange tokenization process, see [here](https://github.com/zhixuan-lin/forgetting-transformer/blob/main/src/forgetting_transformer/tokenizer.py)). Therefore, this model is only suitable for research purposes (e.g., inspecting attention maps). Also, if you want to compare this model with other models trained in another setting with another dataset, **you should definitely train it from scratch on your own dataset under your own setting for the comparison.**
95
 
 
 
96
  ## Training Details
97
 
98
  ### Training Data
 
105
 
106
  Please see [our paper](https://arxiv.org/abs/2503.02130) for details. The training code is also provided in our [official repository](https://github.com/zhixuan-lin/forgetting-transformer).
107
 
 
 
108
  **BibTeX:**
109
 
110
  ```