---
library_name: transformers
tags:
- generated_from_trainer
model-index:
- name: Llama-speechlmm-1.0-l
  results: []
---

## Model information

The SpeechLMM 1.0 collection of multimodal and multilingual large language models is a collection of instruction-tuned generative models in 4 different sizes: S (2B), M (4B), L (9B) and XL (71B), supporting text, audio and video as input and only text as output. The SpeechLMM 1.0 models are optimized for various X-to-text generation tasks, namely:

- Machine Translation
- Automatic Speech Recognition
- Speech Translation
- Speech Summarization
- Spoken Question Answering
- Spoken Language Understanding
- Visual Speech Recognition
- Text Summarization

**Model Developer:** Meetween consortium

**Supported Languages:** English, French, Italian, German, and Spanish are officially supported (for a subset of the supported tasks). The Llama 3.X backbone and the SeamlessM4T v2 audio encoder have been trained on a broader collection of languages than these 5 supported languages, so the model might exhibit good performance on other languages too.

**Model Release Date:** Feb 28, 2025

**License:** see [LICENSE](LICENSE)

### Model Architecture

SpeechLMM 1.0 an auto-regressive multimodal language model based on a Llama 3.X backbone (X varies with the model size), a speech-specific stack consisting of a pre-trained audio encoder ([SeamlessM4T v2](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/)) and an audio adapter, and a video-specific stack consisting of a pre-trained video encoder ([Auto-AVSR](https://ieeexplore.ieee.org/document/10096889)) and a video adapter. 

<!-- TODO: add the image of the model architecture here -->

| Model            | Params      | Input modalities                           | Output modalities | Context Length |
|:---------------- |:----------- |:------------------------------------------ |:----------------- |:-------------- |
| SpeechLMM 1.0 S  | 2B (2.17B)  | Multilingual text and audio, English video | Multilingual Text | 128k           |
| SpeechLMM 1.0 M  | 4B (4.15B)  | Multilingual text and audio, English video | Multilingual Text | 128k           |
| SpeechLMM 1.0 L  | 9B (8.98B)  | Multilingual text and audio, English video | Multilingual Text | 128k           |
| SpeechLMM 1.0 XL (beta) | 71B (71.5B) | Multilingual text and audio, English video | Multilingual Text | 128k           |

#### Audio and video encoders

For all the 4 sizes of SpeechLMM 1.0, the audio encoder is **SeamlessM4T v2 Large** (`facebook/seamless-m4t-v2-large`) and the video encoder is **Auto-AVSR** (`vsr_trlrs3vox2_base`).

#### Audio and video adapters

For all the 4 sizes of SpeechLMM 1.0, the audio and video adapters are:
| Modality | Architecture                           | Number of layers | Compression factor |
| :------- | :------------------------------------- | :--------------- | :----------------- |
| Audio    | Window-level Q-former <br> (4 queries) |                4 |                  4 |
| Video    | MLP                                    |                4 |                  1 |


#### LLM backbone

| Model                   | Backbone               |
|:----------------------- |:---------------------- |
| SpeechLMM 1.0 S         | Llama 3.2 1B Instruct  |
| SpeechLMM 1.0 M         | Llama 3.2 3B Instruct  |
| SpeechLMM 1.0 L         | Llama 3.1 8B Instruct  |
| SpeechLMM 1.0 XL (beta) | Llama 3.3 70B Instruct |

## How to use

Currently, this model can only be used via our [`speechlmm`](https://github.com/meetween/speechlmm) codebase. Refer to the instructions there for more details.

Important: before you can use this model, you must download the SeamlessM4T v2 speech encoder and the Auto-AVSR video encoder by following the instructions provided in the README of the above repo. Please note that by doing so, you agree with their respective license terms.

## Training Data

### Monolingual

| TASK     | Task name                    | Dataset            | Language | License                                    |
| -------- | ---------------------------- | ------------------ | -------- | ------------------------------------------ |
| **ASR**  | Automatic Speech Recognition | **LibriHeavy**     | en       | CC-BY-4.0                                  |
|          |                              | **LibriTTS**       | en       | CC BY 4.0                                  |
|          |                              | **AMI**            | en       | CC-BY-4.0                                  |
|          |                              | **ICSI**           | en       | CC-BY-4.0                                  |
| **LIPREAD**  | Visual Speech Recognition    | **LRS2-BBC**       | en       | Custom                                     |
| **SSUM** | Speech Summarization         | **AMI**            | en       | CC-BY-4.0                                  |
|          |                              | **ICSI**           | en       | CC-BY-4.0                                  |
| **SQA**  | Spoken Question Answering    | **Spoken SQuAD**   | en       | CC-BY-SA-4.0                               |
| **SLU**  | Spoken Language Understanding| **SLURP**          | en       | CC BY 4.0 (text) <br> CC BY-NC 4.0 (audio) |

### Multilingual

| TASK             | Task name                     | Dataset                              | Language                                    | License                                    |
| ---------------- | ----------------------------- | ------------------------------------ | ------------------------------------------- | ------------------------------------------ |
| **ASR**          | Automatic Speech Recognition  | **CoVoST2**                          | en, fr, it, de, es                          | CC0                                        |
|                  |                               | **CommonVoice**                      | en, fr, it, de, es                          | Apache-2.0                                 |
| **ST**           | Speech-to-text Translation    | **CoVoST2**                          | en → de, {fr, it, de, es} → en              | CC0                                        |
|                  |                               | **EuroParl-ST**                      | {en, fr, it, de, es} → {en, fr, it, de, es} | CC-BY-NC-4.0                               |
| **MT**           | Machine Translation           | **EuroParl-ST**                      | {en, fr, it, de, es} → {en, fr, it, de, es} | CC-BY-NC-4.0                               |
| **TextInstruct** | Text Instruction Following    | **Everything_Instruct_Multilingual** | en, fr, it, de, es, ru, zh, ko, ur, la, ar,<br>hi, ja, nl, pt                                          | Apache-2.0                                 |
| **SLU**          | Spoken Language Understanding | **Speech-Massive**                   | fr, de                                      | CC-BY-NC-SA-4.0                            |

## Evaluation Results
The following results specifically refer to the L model.

### ASR Metrics
| Dataset          | Language   | WER ⬇|
|:-----------------|:-----------|------:|
| **ACL 60-60**    | en         | 24.52 |
| **CoVoST 2**     | it         | 28    |
| **CoVoST 2**     | es         | 32.36 |
| **CoVoST 2**     | fr         | 16.51 |
| **CoVoST 2**     | de         | 37.94 |

### SLU Metrics
| Dataset            | Language   | Intent accuracy ⬆|
|:-------------------|:-----------|------------------:|
| **Speech-MASSIVE** | en         |             24.52 |
| **Speech-MASSIVE** | it         |             28    |
| **SLURP**          | es         |             32.36 |

### SQA Metrics
| Dataset          | Language   | Exact Match ⬆| F1 Score ⬆|
|:-----------------|:-----------|-------------:|-----------:|
| **Spoken SQuAD** | en         |        65.45 |      77.35 |

### SSUM Metrics
| Dataset   | Language   |   R-1 F1 ⬆|   R-2 F1 ⬆|   R-L F1 ⬆| 
|:----------|:-----------|----------:|-----------:|-----------:|
| **ICSI**  | en         |      26.6 |        3.5 |       23.9 |

### ST Metrics
| Dataset       | Source Language   | Target Language   |   BLEU ⬆|
|:--------------|:------------------|:------------------|--------:|
| **CoVoST 2**  | en                | de                |    31.1 |
| **CoVoST 2**  | de                | en                |    36.2 |
| **CoVoST 2**  | es                | en                |    41.1 |
| **CoVoST 2**  | fr                | en                |    39.0 |
| **CoVoST 2**  | it                | en                |    32.5 |
| **ACL 60-60** | en                | fr                |    29.1 |
| **ACL 60-60** | en                | de                |    27.6 |

### LIPREAD Metrics
| Dataset          | Language   | WER ⬇|
|:-----------------|:-----------|------:|
| **LRS2-BBC**     | en         | 45.44 |

### MT Metrics
| Dataset     | Source Language   | Target Language   |   BLEU ⬆|
|:------------|:------------------|:------------------|--------:|
| **FLORES**  | en                | de                |    29.4 |
| **FLORES**  | en                | es                |    22.3 |
| **FLORES**  | en                | it                |    20.1 |
| **FLORES**  | en                | fr                |    31.9 |
| **ACL6060** | en                | fr                |    35.5 |
| **ACL6060** | en                | de                |    32.8 |

### TSUM Metrics
| Dataset          | Language   |   R-1 F1 ⬆|   R-2 F1 ⬆|   R-L F1 ⬆| 
|:-----------------|:-----------|----------:|-----------:|-----------:|
| **ICSI**         | en         |      26.2 |        3.2 |       23.3 |
| **AutoMin 2023** | en         |      28.3 |        2.9 |       25.9 |

## Framework versions

- Transformers 4.45.0
- Pytorch 2.3.1+cu124.post2
- Datasets 3.2.0
- Tokenizers 0.20.0