Add pipeline tag, tags and license to metadata
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,14 +1,17 @@
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
-
tags:
|
|
|
|
|
|
|
|
|
|
|
4 |
---
|
5 |
|
6 |
# DeltaNet Model Checkpoint for the Forgetting Transformer Paper
|
7 |
|
8 |
The final checkpoint for the 760M-parameter DeltaNet model in the main experiment of the ICLR 2025 paper [Forgetting Transformer: Softmax Attention with a Forget Gate](https://arxiv.org/abs/2503.02130).
|
9 |
|
10 |
-
|
11 |
-
|
12 |
## Model Details
|
13 |
|
14 |
### Model Description
|
@@ -85,15 +88,11 @@ out = model(input_ids=input_ids)
|
|
85 |
assert out.logits.size() == (batch_size, seq_len, tokenizer.vocab_size)
|
86 |
```
|
87 |
|
88 |
-
|
89 |
-
|
90 |
## Limitations
|
91 |
|
92 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
93 |
This is a small model trained on a small number of tokens from LongCrawl64, provided for reproducibility and research purposes. Also, as a long-context dataset for research purposes, LongCrawl64 is not designed for optimal downstream task performance (it also has a strange tokenization process, see [here](https://github.com/zhixuan-lin/forgetting-transformer/blob/main/src/forgetting_transformer/tokenizer.py)). Therefore, this model is only suitable for research purposes (e.g., inspecting attention maps). Also, if you want to compare this model with other models trained in another setting with another dataset, **you should definitely train it from scratch on your own dataset under your own setting for the comparison.**
|
94 |
|
95 |
-
|
96 |
-
|
97 |
## Training Details
|
98 |
|
99 |
### Training Data
|
@@ -106,8 +105,6 @@ This model is trained on roughly 48B tokens on LongCrawl64, with a training cont
|
|
106 |
|
107 |
Please see [our paper](https://arxiv.org/abs/2503.02130) for details. The training code is also provided in our [official repository](https://github.com/zhixuan-lin/forgetting-transformer).
|
108 |
|
109 |
-
|
110 |
-
|
111 |
**BibTeX:**
|
112 |
|
113 |
```
|
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
+
tags:
|
4 |
+
- long-context
|
5 |
+
- forgetting-attention
|
6 |
+
- deltanet
|
7 |
+
pipeline_tag: text-generation
|
8 |
+
license: mit
|
9 |
---
|
10 |
|
11 |
# DeltaNet Model Checkpoint for the Forgetting Transformer Paper
|
12 |
|
13 |
The final checkpoint for the 760M-parameter DeltaNet model in the main experiment of the ICLR 2025 paper [Forgetting Transformer: Softmax Attention with a Forget Gate](https://arxiv.org/abs/2503.02130).
|
14 |
|
|
|
|
|
15 |
## Model Details
|
16 |
|
17 |
### Model Description
|
|
|
88 |
assert out.logits.size() == (batch_size, seq_len, tokenizer.vocab_size)
|
89 |
```
|
90 |
|
|
|
|
|
91 |
## Limitations
|
92 |
|
93 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
94 |
This is a small model trained on a small number of tokens from LongCrawl64, provided for reproducibility and research purposes. Also, as a long-context dataset for research purposes, LongCrawl64 is not designed for optimal downstream task performance (it also has a strange tokenization process, see [here](https://github.com/zhixuan-lin/forgetting-transformer/blob/main/src/forgetting_transformer/tokenizer.py)). Therefore, this model is only suitable for research purposes (e.g., inspecting attention maps). Also, if you want to compare this model with other models trained in another setting with another dataset, **you should definitely train it from scratch on your own dataset under your own setting for the comparison.**
|
95 |
|
|
|
|
|
96 |
## Training Details
|
97 |
|
98 |
### Training Data
|
|
|
105 |
|
106 |
Please see [our paper](https://arxiv.org/abs/2503.02130) for details. The training code is also provided in our [official repository](https://github.com/zhixuan-lin/forgetting-transformer).
|
107 |
|
|
|
|
|
108 |
**BibTeX:**
|
109 |
|
110 |
```
|