Audio-to-Audio
Transformers
Safetensors
speech_language_model
slam_scaled / README.md
gallilmaimon's picture
Update README.md
f978b00 verified
|
raw
history blame
7.48 kB
---
library_name: transformers
license: mit
datasets:
- openslr/librispeech_asr
- slprl/SpokenSwag
- slprl/sTinyStories
base_model:
- Qwen/Qwen2.5-0.5B
pipeline_tag: audio-to-audio
---
# Model Card for Model ID
This is a Speech Lanaguage Model trained for generating speech contiuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).
## Model Details
### Model Description
This is a Speech Lanaguage Model, introduced in "_Slamming_: Training a Speech Language Model on One GPU in a Day", focusing on efficient training.
It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
The model was trained by next-token prediction over a subset of LibriSpeech, Libri-Light and a synthetic data
[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories). It was then trained with DPO over
[SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
- **Developed by:** [SLP-RL](https://huggingface.co/slprl)
- **Model type:** SpeechLM
- **License:** MIT
- **Finetuned from model:** [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
### Model Sources
- **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
- **Paper:** [Soon!]
- **Demo:** [Link](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
## Uses
This is a base SpeechLM and as such can be used to generate contiuations for speech segments, or as base for further tuning. See the _SlamKit_
[codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) for some generation examples
### Out-of-Scope Use
This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.
## How to Get Started with the Model
We refer users to the official repository for full usage explainations - [github](https://github.com/slp-rl/slamkit).
## Training Details
We highly encourage users to read the full [paper](), for full training details, a brief overview is provided below.
### Training Data
This model was trained on a subset of [LibriSpeech](https://huggingface.co/datasets/openslr/librispeech_asr) train,
[Libri-Light](https://ai.meta.com/tools/libri-light/) and the synthetic dataset
[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories) for the pre-training phase. It was also trained with DPO on the synthetic
dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
### Training Procedure
This model was trained by next token prediction over several dataset, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
Please refer to the [paper]() or [code](https://github.com/slp-rl/slamkit) for the full training recipes.
#### Preprocessing
Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
official kmeans released with the model in [textlesslib](https://github.com/facebookresearch/textlesslib/tree/main). Units are de-duplicated.
We encourage you to explore the official repository for full details - [github](https://github.com/slp-rl/slamkit).
## Evaluation
The paper provides full results, we do give here some results and also refer to the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) to listen to some samples.
| Model | GPUs | Params | Num Tokens | sBLIMP ↑ | sStoryCloze ↑ | tStoryCloze ↑ | GenPPL ↓ | Auto-BLEU ↓ |
|-------------------------------------------|---------|--------|---------------|-----------|---------------|---------------|----------|-------------|
| **Speech only pre-training** | | | | | | | | |
| GSLM | 8×V100 | 100M | 1B | 54.2 | 53.3 | 66.6 | — | — |
| SyllableLM | 4×A40 | 300M | 16B | 63.7 | — | 75.4 | — | — |
| TWIST-350M | 8×V100 | 305M | 10.8B | 56.2 | — | — | 137.3 | 3.46 |
| TWIST-1.3B | 32×V100 | 1B | 10.8B | 57.0 | 52.4 | 70.6 | 131.8 | 3.20 |
| TWIST-7B | 32×V100 | 7B | 36B | 59.0 | 55.3 | 74.1 | 93.74 | 3.06 |
| TWIST-13B | 32×V100 | 13B | 36B | 59.2 | 55.4 | 76.4 | — | — |
| Scaled Optimal | — | 823M | 82B | **61.3** | 56.7 | 78.0 | — | — |
| Moshi | ?×H100 | 7B | ? | 58.9 | **58.7** | **81.8** | — | — |
| SpiritLM | 64×A100 | 7B | 100B | 58.0 | 54.8 | 72.9 | — | — |
| **With text / preference optimization** | | | | | | | | |
| Scaling Interleaving | — | 9B | ~1T | — | **62.4** | 82.9 | — | — |
| Moshi | ?×H100 | 7B | ~720B | 58.8 | 60.8 | 83.0 | — | — |
| SpiritLM | 64×A100 | 7B | 100B | 58.3 | 61.0 | 82.9 | — | — |
| AlignSLM-1.3B | 64×A100 | 1B | 10.8B + ~158B | 59.8 | 55.0 | 80.0 | — | — |
| AlignSLM-7B | 64×A100 | 7B | 36B + ~158B | **62.3** | 61.1 | **86.8** | — | — |
| **Ours (_Slam_)** | | | | | | | | |
| _Slam_ (-DPO) | 2×A100 | 358M | 16.7B | 58.53 | 58.15 | 80.71 | 67.3 | 3.25 |
| _Slam_ | 1×A5000 | 358M | 1.4B + 5M | 58.86 | 58.04 | 82.04 | 62.8 | 3.88 |
| _Slam_ (scaled) | 2×A100 | 358M | 16.7B + 9M | **61.11** | **61.30** | **84.18** | **46.6** | 3.75 |
### Compute Infrastructure
This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"], focusing on efficient training.
#### Hardware
This model was trained using **only 2 Nvidia A100 GPU** for **48 hours**.
#### Software
The model was trained using the [*SlamKit*](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
easy and efficent training of Speech Language Models.
## Citation
**BibTeX:**
Soon!