slam_scaled / README.md

Update README.md

f978b00 verified 3 months ago

7.48 kB

	---
	library_name: transformers
	license: mit
	datasets:
	- openslr/librispeech_asr
	- slprl/SpokenSwag
	- slprl/sTinyStories
	base_model:
	- Qwen/Qwen2.5-0.5B
	pipeline_tag: audio-to-audio
	---

	# Model Card for Model ID
	This is a Speech Lanaguage Model trained for generating speech contiuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).


	## Model Details

	### Model Description
	This is a Speech Lanaguage Model, introduced in "_Slamming_: Training a Speech Language Model on One GPU in a Day", focusing on efficient training.
	It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
	the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).

	The model was trained by next-token prediction over a subset of LibriSpeech, Libri-Light and a synthetic data
	[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories). It was then trained with DPO over
	[SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).

	- Developed by: [SLP-RL](https://huggingface.co/slprl)
	- Model type: SpeechLM
	- License: MIT
	- Finetuned from model: [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)

	### Model Sources

	- Repository: [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
	- Paper: [Soon!]
	- Demo: [Link](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)

	## Uses
	This is a base SpeechLM and as such can be used to generate contiuations for speech segments, or as base for further tuning. See the _SlamKit_
	[codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) for some generation examples

	### Out-of-Scope Use
	This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.



	## How to Get Started with the Model
	We refer users to the official repository for full usage explainations - [github](https://github.com/slp-rl/slamkit).


	## Training Details
	We highly encourage users to read the full [paper](), for full training details, a brief overview is provided below.


	### Training Data
	This model was trained on a subset of [LibriSpeech](https://huggingface.co/datasets/openslr/librispeech_asr) train,
	[Libri-Light](https://ai.meta.com/tools/libri-light/) and the synthetic dataset
	[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories) for the pre-training phase. It was also trained with DPO on the synthetic
	dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).

	### Training Procedure
	This model was trained by next token prediction over several dataset, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
	Please refer to the [paper]() or [code](https://github.com/slp-rl/slamkit) for the full training recipes.

	#### Preprocessing
	Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
	official kmeans released with the model in [textlesslib](https://github.com/facebookresearch/textlesslib/tree/main). Units are de-duplicated.
	We encourage you to explore the official repository for full details - [github](https://github.com/slp-rl/slamkit).


	## Evaluation
	The paper provides full results, we do give here some results and also refer to the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) to listen to some samples.

	\| Model \| GPUs \| Params \| Num Tokens \| sBLIMP ↑ \| sStoryCloze ↑ \| tStoryCloze ↑ \| GenPPL ↓ \| Auto-BLEU ↓ \|
	\|-------------------------------------------\|---------\|--------\|---------------\|-----------\|---------------\|---------------\|----------\|-------------\|
	\| Speech only pre-training \| \| \| \| \| \| \| \| \|
	\| GSLM \| 8×V100 \| 100M \| 1B \| 54.2 \| 53.3 \| 66.6 \| — \| — \|
	\| SyllableLM \| 4×A40 \| 300M \| 16B \| 63.7 \| — \| 75.4 \| — \| — \|
	\| TWIST-350M \| 8×V100 \| 305M \| 10.8B \| 56.2 \| — \| — \| 137.3 \| 3.46 \|
	\| TWIST-1.3B \| 32×V100 \| 1B \| 10.8B \| 57.0 \| 52.4 \| 70.6 \| 131.8 \| 3.20 \|
	\| TWIST-7B \| 32×V100 \| 7B \| 36B \| 59.0 \| 55.3 \| 74.1 \| 93.74 \| 3.06 \|
	\| TWIST-13B \| 32×V100 \| 13B \| 36B \| 59.2 \| 55.4 \| 76.4 \| — \| — \|
	\| Scaled Optimal \| — \| 823M \| 82B \| 61.3 \| 56.7 \| 78.0 \| — \| — \|
	\| Moshi \| ?×H100 \| 7B \| ? \| 58.9 \| 58.7 \| 81.8 \| — \| — \|
	\| SpiritLM \| 64×A100 \| 7B \| 100B \| 58.0 \| 54.8 \| 72.9 \| — \| — \|
	\| With text / preference optimization \| \| \| \| \| \| \| \| \|
	\| Scaling Interleaving \| — \| 9B \| ~1T \| — \| 62.4 \| 82.9 \| — \| — \|
	\| Moshi \| ?×H100 \| 7B \| ~720B \| 58.8 \| 60.8 \| 83.0 \| — \| — \|
	\| SpiritLM \| 64×A100 \| 7B \| 100B \| 58.3 \| 61.0 \| 82.9 \| — \| — \|
	\| AlignSLM-1.3B \| 64×A100 \| 1B \| 10.8B + ~158B \| 59.8 \| 55.0 \| 80.0 \| — \| — \|
	\| AlignSLM-7B \| 64×A100 \| 7B \| 36B + ~158B \| 62.3 \| 61.1 \| 86.8 \| — \| — \|
	\| Ours (_Slam_) \| \| \| \| \| \| \| \| \|
	\| _Slam_ (-DPO) \| 2×A100 \| 358M \| 16.7B \| 58.53 \| 58.15 \| 80.71 \| 67.3 \| 3.25 \|
	\| _Slam_ \| 1×A5000 \| 358M \| 1.4B + 5M \| 58.86 \| 58.04 \| 82.04 \| 62.8 \| 3.88 \|
	\| _Slam_ (scaled) \| 2×A100 \| 358M \| 16.7B + 9M \| 61.11 \| 61.30 \| 84.18 \| 46.6 \| 3.75 \|



	### Compute Infrastructure
	This model was trained as part of ["Slamming: Training a Speech Language Model on One GPU in a Day"], focusing on efficient training.

	#### Hardware
	This model was trained using only 2 Nvidia A100 GPU for 48 hours.

	#### Software
	The model was trained using the [SlamKit](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
	easy and efficent training of Speech Language Models.

	## Citation

	BibTeX:
	Soon!