File size: 15,750 Bytes
eaa4729 6b2b3a3 eaa4729 6b2b3a3 07abb7f 6b18940 50b3ec2 07abb7f eaa4729 6b2b3a3 eaa4729 9891783 eaa4729 2f5477a eaa4729 2f5477a eaa4729 6b2b3a3 eaa4729 80c9c83 eaa4729 6b2b3a3 eaa4729 6b2b3a3 eaa4729 6b2b3a3 eaa4729 6b2b3a3 eaa4729 6b2b3a3 eaa4729 df28cad eaa4729 3187037 eaa4729 2cc0aac eaa4729 5b41c5f eaa4729 6b2b3a3 eaa4729 6b2b3a3 8cb11d3 766a88b eaa4729 8f3ba99 eaa4729 8f3ba99 07abb7f 6b18940 50b3ec2 07abb7f 6b2b3a3 eaa4729 6b2b3a3 eaa4729 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 |
---
license: apache-2.0
language:
- en
base_model:
- ibm-granite/granite-3.3-8b-instruct
library_name: transformers
---
# Granite-speech-3.3-8b
**Model Summary:**
Granite-speech-3.3-8b is a compact and efficient speech-language model, specifically designed for automatic speech recognition (ASR) and automatic speech translation (AST). Granite-speech-3.3-8b uses a two-pass design, unlike integrated models that combine speech and language into a single pass. Initial calls to granite-speech-3.3-8b will transcribe audio files into text. To process the transcribed text using the underlying Granite language model, users must make a second call as each step must be explicitly initiated.
The model was trained on a collection of public corpora comprising diverse datasets for ASR and AST as well as synthetic datasets tailored to support the speech translation task. Granite-speech-3.3 was trained by modality aligning granite-3.3-8b-instruct (https://huggingface.co/ibm-granite/granite-3.3-8b-instruct) to speech on publicly available open source corpora containing audio inputs and text targets.
We are currently investigating an issue with greedy decoding (```num_beams=1```); the model performs reliably with beam sizes > 1, which we recommend for all use cases.
Additionally, the model may occasionally hallucinate on very short audio inputs (<0.1s). These issues are under active investigation, and we will update guidance as fixes become available.
**Evaluations:**
We evaluated granite-speech-3.3-8b alongside other speech-language models (SLMs) in the less than 8b parameter range as well as dedicated ASR and AST systems on standard benchmarks. The evaluation spanned multiple public benchmarks, with particular emphasis on English ASR tasks while also including AST for En-X translation.



**Release Date**: April 15, 2025
**License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
**Supported Languages:**
English
**Intended Use:**
The model is intended to be used in enterprise applications that involve processing of speech inputs. In particular, the model is well-suited for English speech-to-text and speech translations from English to some major European languages such as French, Spanish, Italian, German, Portuguese as well as Japanese and Mandarin. For tasks that exclusively involve text-based input, we suggest using our Granite large language models, which are optimized for text-only processing and offer superior performance compared to this model.
## Generation:
Granite Speech model is supported natively in `transformers` from the `main` branch. Below is a simple example of how to use the `granite-speech-3.3-8b` model.
### Usage with `transformers`
First, make sure to build the latest version of transformers from source:
```shell
pip install https://github.com/huggingface/transformers/archive/main.zip torchaudio peft soundfile
```
Then run the code:
```python
import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from huggingface_hub import hf_hub_download
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "ibm-granite/granite-speech-3.3-8b"
speech_granite_processor = AutoProcessor.from_pretrained(
model_name)
tokenizer = speech_granite_processor.tokenizer
speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(
model_name).to(device)
# prepare speech and text prompt, using the appropriate prompt template
audio_path = hf_hub_download(repo_id=model_name, filename='10226_10111_000000.wav')
wav, sr = torchaudio.load(audio_path, normalize=True)
assert wav.shape[0] == 1 and sr == 16000 # mono, 16khz
# create text prompt
chat = [
{
"role": "system",
"content": "Knowledge Cutoff Date: April 2024.\nToday's Date: April 9, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
},
{
"role": "user",
"content": "<|audio|>can you transcribe the speech into a written format?",
}
]
text = tokenizer.apply_chat_template(
chat, tokenize=False, add_generation_prompt=True
)
# compute audio embeddings
model_inputs = speech_granite_processor(
text,
wav,
device=device, # Computation device; returned tensors are put on CPU
return_tensors="pt",
).to(device)
model_outputs = speech_granite.generate(
**model_inputs,
max_new_tokens=200,
num_beams=4,
do_sample=False,
min_length=1,
top_p=1.0,
repetition_penalty=1.0,
length_penalty=1.0,
temperature=1.0,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
# Transformers includes the input IDs in the response.
num_input_tokens = model_inputs["input_ids"].shape[-1]
new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)
output_text = tokenizer.batch_decode(
new_tokens, add_special_tokens=False, skip_special_tokens=True
)
print(f"STT output = {output_text[0].upper()}")
```
### Usage with `vLLM`
First, make sure to install the latest version of vLLM:
```shell
pip install vllm --upgrade
```
* Code for offline mode:
```python
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
from vllm.lora.request import LoRARequest
model_id = "ibm-granite/granite-speech-3.3-8b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
def get_prompt(question: str, has_audio: bool):
"""Build the input prompt to send to vLLM."""
if has_audio:
question = f"<|audio|>{question}"
chat = [
{
"role": "user",
"content": question
}
]
return tokenizer.apply_chat_template(chat, tokenize=False)
# NOTE - you may see warnings about multimodal lora layers being ignored;
# this is okay as the lora in this model is only applied to the LLM.
model = LLM(
model=model_id,
enable_lora=True,
max_lora_rank=64,
max_model_len=2048, # This may be needed for lower resource devices.
limit_mm_per_prompt={"audio": 1},
)
### 1. Example with Audio [make sure to use the lora]
question = "can you transcribe the speech into a written format?"
prompt_with_audio = get_prompt(
question=question,
has_audio=True,
)
audio = AudioAsset("mary_had_lamb").audio_and_sample_rate
inputs = {
"prompt": prompt_with_audio,
"multi_modal_data": {
"audio": audio,
}
}
outputs = model.generate(
inputs,
sampling_params=SamplingParams(
temperature=0.2,
max_tokens=64,
),
lora_request=[LoRARequest("speech", 1, model_id)]
)
print(f"Audio Example - Question: {question}")
print(f"Generated text: {outputs[0].outputs[0].text}")
### 2. Example without Audio [do NOT use the lora]
question = "What is the capital of Brazil?"
prompt = get_prompt(
question=question,
has_audio=False,
)
outputs = model.generate(
{"prompt": prompt},
sampling_params=SamplingParams(
temperature=0.2,
max_tokens=12,
),
)
print(f"Text Only Example - Question: {question}")
print(f"Generated text: {outputs[0].outputs[0].text}")
```
* Code for online mode:
```python
"""
Launch the vLLM server with the following command:
vllm serve ibm-granite/granite-speech-3.3-8b \
--api-key token-abc123 \
--max-model-len 2048 \
--enable-lora \
--lora-modules speech=ibm-granite/granite-speech-3.3-8b \
--max-lora-rank 64
"""
import base64
import requests
from openai import OpenAI
from vllm.assets.audio import AudioAsset
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "token-abc123"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
# defaults to os.environ.get("OPENAI_API_KEY")
api_key=openai_api_key,
base_url=openai_api_base,
)
base_model_name = "ibm-granite/granite-speech-3.3-8b"
lora_model_name = "speech"
# Any format supported by librosa is supported
audio_url = AudioAsset("mary_had_lamb").url
# Use base64 encoded audio in the payload
def encode_audio_base64_from_url(audio_url: str) -> str:
"""Encode an audio retrieved from a remote url to base64 format."""
with requests.get(audio_url) as response:
response.raise_for_status()
result = base64.b64encode(response.content).decode('utf-8')
return result
audio_base64 = encode_audio_base64_from_url(audio_url=audio_url)
### 1. Example with Audio
# NOTE: we pass the name of the lora model (`speech`) here because we have audio.
question = "can you transcribe the speech into a written format?"
chat_completion_with_audio = client.chat.completions.create(
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": question
},
{
"type": "audio_url",
"audio_url": {
# Any format supported by librosa is supported
"url": f"data:audio/ogg;base64,{audio_base64}"
},
},
],
}],
temperature=0.2,
max_tokens=64,
model=lora_model_name,
)
print(f"Audio Example - Question: {question}")
print(f"Generated text: {chat_completion_with_audio.choices[0].message.content}")
### 2. Example without Audio
# NOTE: we pass the name of the base model here because we do not have audio.
question = "What is the capital of Brazil?"
chat_completion_with_audio = client.chat.completions.create(
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": question
},
],
}],
temperature=0.2,
max_tokens=12,
model=base_model_name,
)
print(f"Text Only Example - Question: {question}")
print(f"Generated text: {chat_completion_with_audio.choices[0].message.content}")
```
**Model Architecture:**
The architecture of granite-speech-3.3-8b consists of the following components:
(1) Speech encoder: 10 conformer blocks trained with Connectionist Temporal Classification (CTC) on character-level targets on the subset containing
only ASR corpora (see configuration below). In addition, our CTC encoder uses block-attention with 4-seconds audio blocks and self-conditioned CTC
from the middle layer.
| Configuration parameter | Value |
|-----------------|----------------------|
| Input dimension | 160 (80 logmels x 2) |
| Nb. of layers | 10 |
| Hidden dimension | 1024 |
| Nb. of attention heads | 8 |
| Attention head size | 128 |
| Convolution kernel size | 15 |
| Output dimension | 42 |
(2) Speech projector and temporal downsampler (speech-text modality adapter): we use a 2-layer window query transformer (q-former) operating on
blocks of 15 1024-dimensional acoustic embeddings coming out of the last conformer block of the speech encoder that get downsampled by a factor of 5
using 3 trainable queries per block and per layer. The total temporal downsampling factor is 10 (2x from the encoder and 5x from the projector)
resulting in a 10Hz acoustic embeddings rate for the LLM. The encoder, projector and LoRA adapters were fine-tuned/trained jointly on all the
corpora mentioned under **Training Data**.
(3) Large language model: granite-3.3-8b-instruct with 128k context length (https://huggingface.co/ibm-granite/granite-3.3-8b-instruct).
(4) LoRA adapters: rank=64 applied to the query, value projection matrices
**Training Data:**
Overall, our training data is largely comprised of two key sources: (1) publicly available datasets (2) Synthetic data created from publicly
available datasets specifically targeting the speech translation task. A detailed description of the training datasets can be found in the table
below:
| Name | Task | Nb. hours | Source |
|-----------|--------------|----------------|--------------|
| CommonVoice-17 English | ASR | 2600 | https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0 |
| MLS English | ASR | 44000 | https://huggingface.co/datasets/facebook/multilingual_librispeech |
| Librispeech | ASR | 1000 | https://huggingface.co/datasets/openslr/librispeech_asr |
| VoxPopuli English | ASR | 500 | https://huggingface.co/datasets/facebook/voxpopuli |
| AMI | ASR | 100 | https://huggingface.co/datasets/edinburghcstr/ami |
| YODAS English | ASR | 10000 | https://huggingface.co/datasets/espnet/yodas |
| Switchboard English | ASR | 260 | https://catalog.ldc.upenn.edu/LDC97S62 |
| CallHome English | ASR | 18 | https://catalog.ldc.upenn.edu/LDC97T14 |
| Fisher | ASR | 2000 | https://catalog.ldc.upenn.edu/LDC2004S13 |
| Voicemail part I | ASR | 40 | https://catalog.ldc.upenn.edu/LDC98S77 |
| Voicemail part II | ASR | 40 | https://catalog.ldc.upenn.edu/LDC2002S35 |
| CommonVoice-17 En->De,Es,Fr,It,Ja,Pt,Zh | AST | 2600*7 | Translations with Phi-4 and MADLAD |
**Infrastructure:**
We train Granite Speech using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable
and efficient infrastructure for training our models over thousands of GPUs. The training of this particular model was completed in 9 days on 32
H100 GPUs.
**Ethical Considerations and Limitations:**
Users should be aware that the model may produce unreliable outputs when decoding with ```num_beams=1``` or when processing extremely short audio clips (<0.1s).
Until further updates are released, we recommend using beam sizes greater than 1 and avoiding inputs below the 0.1-second threshold to ensure more consistent performance.
The use of Large Speech and Language Models may involve risks and ethical considerations that people should be aware of. These risks may include bias and fairness, misinformation, and autonomous decision-making. We urge the community to use granite-speech-3.3-8b in a manner consistent with IBM's Responsible Use Guide or similar responsible use structures. IBM recommends using this model for automatic speech recognition tasks. The model's modular design improves safety by limiting how audio inputs can influence the system. If an unfamiliar or malformed prompt is received, the model simply echoes it with its transcription. This minimizes the risk of adversarial inputs, unlike integrated models that directly interpret audio and may be more exposed to such attacks. Note that more general speech tasks may pose higher inherent risks of triggering unwanted outputs.
To enhance safety, we recommend using granite-speech-3.3-8b alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas. Its training, which includes both human-annotated and synthetic data informed by internal red-teaming, enables it to outperform similar open-source models on standard benchmarks, providing an additional layer of safety.
**Resources**
- ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite
- 🚀 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
- 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources
|