---
library_name: transformers
tags:
- generated_from_trainer
model-index:
- name: Llama-speechlmm-1.0-l
results: []
---
## Model information
The SpeechLMM 1.0 collection of multimodal and multilingual large language models is a collection of instruction-tuned generative models in 4 different sizes: S (2B), M (4B), L (9B) and XL (71B), supporting text, audio and video as input and only text as output. The SpeechLMM 1.0 models are optimized for various X-to-text generation tasks, namely:
- Machine Translation
- Automatic Speech Recognition
- Speech Translation
- Speech Summarization
- Spoken Question Answering
- Spoken Language Understanding
- Visual Speech Recognition
- Text Summarization
**Model Developer:** Meetween consortium
**Supported Languages:** English, French, Italian, German, and Spanish are officially supported (for a subset of the supported tasks). The Llama 3.X backbone and the SeamlessM4T v2 audio encoder have been trained on a broader collection of languages than these 5 supported languages, so the model might exhibit good performance on other languages too.
**Model Release Date:** Feb 28, 2025
**License:** see [LICENSE](LICENSE)
### Model Architecture
SpeechLMM 1.0 an auto-regressive multimodal language model based on a Llama 3.X backbone (X varies with the model size), a speech-specific stack consisting of a pre-trained audio encoder ([SeamlessM4T v2](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/)) and an audio adapter, and a video-specific stack consisting of a pre-trained video encoder ([Auto-AVSR](https://ieeexplore.ieee.org/document/10096889)) and a video adapter.
| Model | Params | Input modalities | Output modalities | Context Length |
|:---------------- |:----------- |:------------------------------------------ |:----------------- |:-------------- |
| SpeechLMM 1.0 S | 2B (2.17B) | Multilingual text and audio, English video | Multilingual Text | 128k |
| SpeechLMM 1.0 M | 4B (4.15B) | Multilingual text and audio, English video | Multilingual Text | 128k |
| SpeechLMM 1.0 L | 9B (8.98B) | Multilingual text and audio, English video | Multilingual Text | 128k |
| SpeechLMM 1.0 XL (beta) | 71B (71.5B) | Multilingual text and audio, English video | Multilingual Text | 128k |
#### Audio and video encoders
For all the 4 sizes of SpeechLMM 1.0, the audio encoder is **SeamlessM4T v2 Large** (`facebook/seamless-m4t-v2-large`) and the video encoder is **Auto-AVSR** (`vsr_trlrs3vox2_base`).
#### Audio and video adapters
For all the 4 sizes of SpeechLMM 1.0, the audio and video adapters are:
| Modality | Architecture | Number of layers | Compression factor |
| :------- | :------------------------------------- | :--------------- | :----------------- |
| Audio | Window-level Q-former
(4 queries) | 4 | 4 |
| Video | MLP | 4 | 1 |
#### LLM backbone
| Model | Backbone |
|:----------------------- |:---------------------- |
| SpeechLMM 1.0 S | Llama 3.2 1B Instruct |
| SpeechLMM 1.0 M | Llama 3.2 3B Instruct |
| SpeechLMM 1.0 L | Llama 3.1 8B Instruct |
| SpeechLMM 1.0 XL (beta) | Llama 3.3 70B Instruct |
## How to use
Currently, this model can only be used via our [`speechlmm`](https://github.com/meetween/speechlmm) codebase. Refer to the instructions there for more details.
Important: before you can use this model, you must download the SeamlessM4T v2 speech encoder and the Auto-AVSR video encoder by following the instructions provided in the README of the above repo. Please note that by doing so, you agree with their respective license terms.
## Training Data
### Monolingual
| TASK | Task name | Dataset | Language | License |
| -------- | ---------------------------- | ------------------ | -------- | ------------------------------------------ |
| **ASR** | Automatic Speech Recognition | **LibriHeavy** | en | CC-BY-4.0 |
| | | **LibriTTS** | en | CC BY 4.0 |
| | | **AMI** | en | CC-BY-4.0 |
| | | **ICSI** | en | CC-BY-4.0 |
| **LIPREAD** | Visual Speech Recognition | **LRS2-BBC** | en | Custom |
| **SSUM** | Speech Summarization | **AMI** | en | CC-BY-4.0 |
| | | **ICSI** | en | CC-BY-4.0 |
| **SQA** | Spoken Question Answering | **Spoken SQuAD** | en | CC-BY-SA-4.0 |
| **SLU** | Spoken Language Understanding| **SLURP** | en | CC BY 4.0 (text)
CC BY-NC 4.0 (audio) |
### Multilingual
| TASK | Task name | Dataset | Language | License |
| ---------------- | ----------------------------- | ------------------------------------ | ------------------------------------------- | ------------------------------------------ |
| **ASR** | Automatic Speech Recognition | **CoVoST2** | en, fr, it, de, es | CC0 |
| | | **CommonVoice** | en, fr, it, de, es | Apache-2.0 |
| **ST** | Speech-to-text Translation | **CoVoST2** | en → de, {fr, it, de, es} → en | CC0 |
| | | **EuroParl-ST** | {en, fr, it, de, es} → {en, fr, it, de, es} | CC-BY-NC-4.0 |
| **MT** | Machine Translation | **EuroParl-ST** | {en, fr, it, de, es} → {en, fr, it, de, es} | CC-BY-NC-4.0 |
| **TextInstruct** | Text Instruction Following | **Everything_Instruct_Multilingual** | en, fr, it, de, es, ru, zh, ko, ur, la, ar,
hi, ja, nl, pt | Apache-2.0 |
| **SLU** | Spoken Language Understanding | **Speech-Massive** | fr, de | CC-BY-NC-SA-4.0 |
## Evaluation Results
The following results specifically refer to the L model.
### ASR Metrics
| Dataset | Language | WER ⬇|
|:-----------------|:-----------|------:|
| **ACL 60-60** | en | 24.52 |
| **CoVoST 2** | it | 28 |
| **CoVoST 2** | es | 32.36 |
| **CoVoST 2** | fr | 16.51 |
| **CoVoST 2** | de | 37.94 |
### SLU Metrics
| Dataset | Language | Intent accuracy ⬆|
|:-------------------|:-----------|------------------:|
| **Speech-MASSIVE** | en | 24.52 |
| **Speech-MASSIVE** | it | 28 |
| **SLURP** | es | 32.36 |
### SQA Metrics
| Dataset | Language | Exact Match ⬆| F1 Score ⬆|
|:-----------------|:-----------|-------------:|-----------:|
| **Spoken SQuAD** | en | 65.45 | 77.35 |
### SSUM Metrics
| Dataset | Language | R-1 F1 ⬆| R-2 F1 ⬆| R-L F1 ⬆|
|:----------|:-----------|----------:|-----------:|-----------:|
| **ICSI** | en | 26.6 | 3.5 | 23.9 |
### ST Metrics
| Dataset | Source Language | Target Language | BLEU ⬆|
|:--------------|:------------------|:------------------|--------:|
| **CoVoST 2** | en | de | 31.1 |
| **CoVoST 2** | de | en | 36.2 |
| **CoVoST 2** | es | en | 41.1 |
| **CoVoST 2** | fr | en | 39.0 |
| **CoVoST 2** | it | en | 32.5 |
| **ACL 60-60** | en | fr | 29.1 |
| **ACL 60-60** | en | de | 27.6 |
### LIPREAD Metrics
| Dataset | Language | WER ⬇|
|:-----------------|:-----------|------:|
| **LRS2-BBC** | en | 45.44 |
### MT Metrics
| Dataset | Source Language | Target Language | BLEU ⬆|
|:------------|:------------------|:------------------|--------:|
| **FLORES** | en | de | 29.4 |
| **FLORES** | en | es | 22.3 |
| **FLORES** | en | it | 20.1 |
| **FLORES** | en | fr | 31.9 |
| **ACL6060** | en | fr | 35.5 |
| **ACL6060** | en | de | 32.8 |
### TSUM Metrics
| Dataset | Language | R-1 F1 ⬆| R-2 F1 ⬆| R-L F1 ⬆|
|:-----------------|:-----------|----------:|-----------:|-----------:|
| **ICSI** | en | 26.2 | 3.2 | 23.3 |
| **AutoMin 2023** | en | 28.3 | 2.9 | 25.9 |
## Framework versions
- Transformers 4.45.0
- Pytorch 2.3.1+cu124.post2
- Datasets 3.2.0
- Tokenizers 0.20.0