File size: 7,484 Bytes
8a50bbf f6bc579 071ffcb f6bc579 8a50bbf f6bc579 8a50bbf f6bc579 8a50bbf f6bc579 8a50bbf f6bc579 8a50bbf f6bc579 8a50bbf f978b00 f6bc579 8a50bbf f978b00 8a50bbf f6bc579 8a50bbf f978b00 8a50bbf f6bc579 8a50bbf f6bc579 8a50bbf f6bc579 f978b00 8a50bbf f6bc579 f978b00 8a50bbf f6bc579 8a50bbf f6bc579 8a50bbf f6bc579 8a50bbf f978b00 f6bc579 8a50bbf f6bc579 8a50bbf f6bc579 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
---
library_name: transformers
license: mit
datasets:
- openslr/librispeech_asr
- slprl/SpokenSwag
- slprl/sTinyStories
base_model:
- Qwen/Qwen2.5-0.5B
pipeline_tag: audio-to-audio
---
# Model Card for Model ID
This is a Speech Lanaguage Model trained for generating speech contiuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).
## Model Details
### Model Description
This is a Speech Lanaguage Model, introduced in "_Slamming_: Training a Speech Language Model on One GPU in a Day", focusing on efficient training.
It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
The model was trained by next-token prediction over a subset of LibriSpeech, Libri-Light and a synthetic data
[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories). It was then trained with DPO over
[SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
- **Developed by:** [SLP-RL](https://huggingface.co/slprl)
- **Model type:** SpeechLM
- **License:** MIT
- **Finetuned from model:** [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
### Model Sources
- **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
- **Paper:** [Soon!]
- **Demo:** [Link](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
## Uses
This is a base SpeechLM and as such can be used to generate contiuations for speech segments, or as base for further tuning. See the _SlamKit_
[codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) for some generation examples
### Out-of-Scope Use
This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.
## How to Get Started with the Model
We refer users to the official repository for full usage explainations - [github](https://github.com/slp-rl/slamkit).
## Training Details
We highly encourage users to read the full [paper](), for full training details, a brief overview is provided below.
### Training Data
This model was trained on a subset of [LibriSpeech](https://huggingface.co/datasets/openslr/librispeech_asr) train,
[Libri-Light](https://ai.meta.com/tools/libri-light/) and the synthetic dataset
[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories) for the pre-training phase. It was also trained with DPO on the synthetic
dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
### Training Procedure
This model was trained by next token prediction over several dataset, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
Please refer to the [paper]() or [code](https://github.com/slp-rl/slamkit) for the full training recipes.
#### Preprocessing
Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
official kmeans released with the model in [textlesslib](https://github.com/facebookresearch/textlesslib/tree/main). Units are de-duplicated.
We encourage you to explore the official repository for full details - [github](https://github.com/slp-rl/slamkit).
## Evaluation
The paper provides full results, we do give here some results and also refer to the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) to listen to some samples.
| Model | GPUs | Params | Num Tokens | sBLIMP β | sStoryCloze β | tStoryCloze β | GenPPL β | Auto-BLEU β |
|-------------------------------------------|---------|--------|---------------|-----------|---------------|---------------|----------|-------------|
| **Speech only pre-training** | | | | | | | | |
| GSLM | 8ΓV100 | 100M | 1B | 54.2 | 53.3 | 66.6 | β | β |
| SyllableLM | 4ΓA40 | 300M | 16B | 63.7 | β | 75.4 | β | β |
| TWIST-350M | 8ΓV100 | 305M | 10.8B | 56.2 | β | β | 137.3 | 3.46 |
| TWIST-1.3B | 32ΓV100 | 1B | 10.8B | 57.0 | 52.4 | 70.6 | 131.8 | 3.20 |
| TWIST-7B | 32ΓV100 | 7B | 36B | 59.0 | 55.3 | 74.1 | 93.74 | 3.06 |
| TWIST-13B | 32ΓV100 | 13B | 36B | 59.2 | 55.4 | 76.4 | β | β |
| Scaled Optimal | β | 823M | 82B | **61.3** | 56.7 | 78.0 | β | β |
| Moshi | ?ΓH100 | 7B | ? | 58.9 | **58.7** | **81.8** | β | β |
| SpiritLM | 64ΓA100 | 7B | 100B | 58.0 | 54.8 | 72.9 | β | β |
| **With text / preference optimization** | | | | | | | | |
| Scaling Interleaving | β | 9B | ~1T | β | **62.4** | 82.9 | β | β |
| Moshi | ?ΓH100 | 7B | ~720B | 58.8 | 60.8 | 83.0 | β | β |
| SpiritLM | 64ΓA100 | 7B | 100B | 58.3 | 61.0 | 82.9 | β | β |
| AlignSLM-1.3B | 64ΓA100 | 1B | 10.8B + ~158B | 59.8 | 55.0 | 80.0 | β | β |
| AlignSLM-7B | 64ΓA100 | 7B | 36B + ~158B | **62.3** | 61.1 | **86.8** | β | β |
| **Ours (_Slam_)** | | | | | | | | |
| _Slam_ (-DPO) | 2ΓA100 | 358M | 16.7B | 58.53 | 58.15 | 80.71 | 67.3 | 3.25 |
| _Slam_ | 1ΓA5000 | 358M | 1.4B + 5M | 58.86 | 58.04 | 82.04 | 62.8 | 3.88 |
| _Slam_ (scaled) | 2ΓA100 | 358M | 16.7B + 9M | **61.11** | **61.30** | **84.18** | **46.6** | 3.75 |
### Compute Infrastructure
This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"], focusing on efficient training.
#### Hardware
This model was trained using **only 2 Nvidia A100 GPU** for **48 hours**.
#### Software
The model was trained using the [*SlamKit*](https://github.com/slp-rl/slamkit) codebase which builds upon π€transformers extending it to support
easy and efficent training of Speech Language Models.
## Citation
**BibTeX:**
Soon! |