slam_scaled / README.md

Improve language tag

0d2e06b verified 12 days ago

9.29 kB

	---
	base_model:
	- Qwen/Qwen2.5-0.5B
	datasets:
	- openslr/librispeech_asr
	- slprl/SpokenSwag
	- slprl/sTinyStories
	library_name: transformers
	license: mit
	pipeline_tag: audio-to-audio
	language:
	- zho
	- eng
	- fra
	- spa
	- por
	- deu
	- ita
	- rus
	- jpn
	- kor
	- vie
	- tha
	- ara
	---

	# Slamming: Training a Speech Language Model on One GPU in a Day

	The model was presented in the paper [Slamming: Training a Speech Language Model on One GPU in a Day](https://arxiv.org/abs/2502.15814).

	# Paper abstract

	We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .

	# Model Card for Model ID
	This is a Speech Language Model (SLM) trained for generating speech continuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).


	## Model Details

	### Model Description
	This Speech Language Model, introduced in ["_Slamming_: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focuses on efficient training.
	It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
	the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).

	The model was pre-trained using next-token prediction on a subset of LibriSpeech, Libri-Light and a synthetic dataset
	[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories). It was subsequently fine-tuned with DPO on
	[SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).

	- Developed by: [SLP-RL](https://huggingface.co/slprl)
	- Model type: SpeechLM
	- License: MIT
	- Finetuned from model: [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)

	### Model Sources

	- Repository: [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
	- Paper: [https://arxiv.org/abs/2502.15814](https://arxiv.org/abs/2502.15814)
	- Demo: [https://pages.cs.huji.ac.il/adiyoss-lab/slamming/](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)

	## Uses
	This base SpeechLM can be used to generate continuations for speech segments, or as a base for further tuning. See the _SlamKit_
	[codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) for some generation examples

	### Out-of-Scope Use
	This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.



	## How to Get Started with the Model
	We refer users to the official repository for full usage explanations - [github](https://github.com/slp-rl/slamkit).


	## Training Details
	We highly encourage users to read the full [paper](https://arxiv.org/abs/2502.15814), for full training details, a brief overview is provided below.


	### Training Data
	This model was trained on a subset of [LibriSpeech](https://huggingface.co/datasets/openslr/librispeech_asr) train,
	[Libri-Light](https://ai.meta.com/tools/libri-light/) and the synthetic dataset
	[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories) for the pre-training phase. It was also trained with DPO on the synthetic
	dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).

	### Training Procedure
	This model was trained by next token prediction over several datasets, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
	Please refer to the [paper](https://arxiv.org/abs/2502.15814) or [code](https://github.com/slp-rl/slamkit) for the full training recipes.

	#### Preprocessing
	Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
	official kmeans released with the model in [textlesslib](https://github.com/facebookresearch/textlesslib/tree/main). Units are de-duplicated.
	We encourage you to explore the official repository for full details - [github](https://github.com/slp-rl/slamkit).


	## Evaluation
	The paper provides full results, we do give here some results and also refer to the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) to listen to some samples.

	\| Model \| GPUs \| Params \| Num Tokens \| sBLIMP ↑ \| sStoryCloze ↑ \| tStoryCloze ↑ \| GenPPL ↓ \| Auto-BLEU ↓ \|
	\|-------------------------------------------\|---------\|--------\|---------------\|-----------\|---------------\|---------------\|----------\|-------------\|
	\| Speech only pre-training \| \| \| \| \| \| \| \| \|
	\| GSLM \| 8×V100 \| 100M \| 1B \| 54.2 \| 53.3 \| 66.6 \| — \| — \|
	\| SyllableLM \| 4×A40 \| 300M \| 16B \| 63.7 \| — \| 75.4 \| — \| — \|
	\| TWIST-350M \| 8×V100 \| 305M \| 10.8B \| 56.2 \| — \| — \| 137.3 \| 3.46 \|
	\| TWIST-1.3B \| 32×V100 \| 1B \| 10.8B \| 57.0 \| 52.4 \| 70.6 \| 131.8 \| 3.20 \|
	\| TWIST-7B \| 32×V100 \| 7B \| 36B \| 59.0 \| 55.3 \| 74.1 \| 93.74 \| 3.06 \|
	\| TWIST-13B \| 32×V100 \| 13B \| 36B \| 59.2 \| 55.4 \| 76.4 \| — \| — \|
	\| Scaled Optimal \| — \| 823M \| 82B \| 61.3 \| 56.7 \| 78.0 \| — \| — \|
	\| Moshi \| ?×H100 \| 7B \| ? \| 58.9 \| 58.7 \| 81.8 \| — \| — \|
	\| SpiritLM \| 64×A100 \| 7B \| 100B \| 58.0 \| 54.8 \| 72.9 \| — \| — \|
	\| With text / preference optimization \| \| \| \| \| \| \| \| \|
	\| Scaling Interleaving \| — \| 9B \| ~1T \| — \| 62.4 \| 82.9 \| — \| — \|
	\| Moshi \| ?×H100 \| 7B \| ~720B \| 58.8 \| 60.8 \| 83.0 \| — \| — \|
	\| SpiritLM \| 64×A100 \| 7B \| 100B \| 58.3 \| 61.0 \| 82.9 \| — \| — \|
	\| AlignSLM-1.3B \| 64×A100 \| 1B \| 10.8B + ~158B \| 59.8 \| 55.0 \| 80.0 \| — \| — \|
	\| AlignSLM-7B \| 64×A100 \| 7B \| 36B + ~158B \| 62.3 \| 61.1 \| 86.8 \| — \| — \|
	\| Ours (_Slam_) \| \| \| \| \| \| \| \| \|
	\| _Slam_ (-DPO) \| 2×A100 \| 358M \| 16.7B \| 58.53 \| 58.15 \| 80.71 \| 67.3 \| 3.25 \|
	\| _Slam_ \| 1×A5000 \| 358M \| 1.4B + 5M \| 58.86 \| 58.04 \| 82.04 \| 62.8 \| 3.88 \|
	\| _Slam_ (scaled) \| 2×A100 \| 358M \| 16.7B + 9M \| 61.11 \| 61.30 \| 84.18 \| 46.6 \| 3.75 \|



	### Compute Infrastructure
	This model was trained as part of ["Slamming: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focusing on efficient training.

	#### Hardware
	This model was trained using only 2 Nvidia A100 GPU for 48 hours.

	#### Software
	The model was trained using the [SlamKit](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
	easy and efficient training of Speech Language Models.

	## Citation

	BibTeX:
	```
	@misc{maimon2025slamming,
	title={Slamming: Training a Speech Language Model on One GPU in a Day},
	author={Gallil Maimon and Avishai Elmakies and Yossi Adi},
	year={2025},
	eprint={2502.15814},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2502.15814},
	}
	```