gsaon commited on
Commit
8cdf7d4
·
verified ·
1 Parent(s): 9096618

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +182 -3
README.md CHANGED
@@ -1,3 +1,182 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - ibm-granite/granite-3.2-8b-instruct
7
+ library_name: transformers
8
+ ---
9
+ # Granite-speech-3.2-8b
10
+
11
+ **Model Summary:**
12
+ Granite-speech-3.2-8b is a compact and efficient speech-language model, specifically designed for automatic speech recognition (ASR) and automatic speech translation (AST). The model was trained on a collection of public corpora comprising diverse datasets for ASR and AST as well as synthetic datasets tailored to support the speech translation task. Granite-speech-3.2 was trained by LoRA fine-tuning granite-3.2-8b-instruct (https://huggingface.co/ibm-granite/granite-3.2-8b-instruct) on publicly available open source corpora containing audio inputs and text targets.
13
+
14
+ **Evaluations:**
15
+
16
+ We evaluated granite-speech-3.2-8b alongside other speech-language models (SLMs) in the less than 8b parameter range as well as dedicated ASR and AST systems on standard benchmarks. The evaluation spanned multiple public benchmarks, with particular emphasis on English ASR tasks while also including AST for En-X translation.
17
+
18
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/2VepW7w-VPcBGKagdpKJK.png)
19
+
20
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/r44KxMuUxybDXWHkx8FAu.png)
21
+
22
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/owRC0DuMGJ9nNq7xUazHc.png)
23
+
24
+ **Release Date**: April 2nd, 2025
25
+
26
+ **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
27
+
28
+ **Supported Languages:**
29
+ English
30
+
31
+ **Intended Use:**
32
+ The model is intended to be used in enterprise applications that involve processing of speech inputs. In particular, the model is well-suited for English speech-to-text and speech translations from English to some major European languages such as French, Spanish, Italian, German, Portuguese as well as Japanese and Mandarin. For tasks that exclusively involve text-based input, we suggest using our Granite large language models, which are optimized for text-only processing and offer superior performance compared to this model.
33
+
34
+ ## Generation:
35
+
36
+ Granite Speech model is in the process of being supported natively in `transformers` from the `main` branch. In the interim, below is a simple example of how to use the `granite-speech-3.2-8b` model.
37
+
38
+ ### Usage with `transformers`
39
+
40
+ First, make sure to build the latest version of transformers:
41
+ ```shell
42
+ pip install transformers>=4.49
43
+ ```
44
+
45
+ Then run the code:
46
+ ```python
47
+ import torch
48
+ import torchaudio
49
+ from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
50
+ from huggingface_hub import hf_hub_download
51
+
52
+ device = "cuda" if torch.cuda.is_available() else "cpu"
53
+
54
+ model_name = "ibm-granite/granite-speech-3.2-8b"
55
+ speech_granite_processor = AutoProcessor.from_pretrained(
56
+ model_name, trust_remote_code=True)
57
+ tokenizer = speech_granite_processor.tokenizer
58
+ speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(
59
+ model_name, trust_remote_code=True).to(device)
60
+
61
+ # prepare speech and text prompt, using the appropriate prompt template
62
+
63
+ audio_path = hf_hub_download(repo_id=model_name, filename='10226_10111_000000.wav')
64
+ wav, sr = torchaudio.load(audio_path, normalize=True)
65
+ assert wav.shape[0] == 1 and sr == 16000 # mono, 16khz
66
+
67
+ # create text prompt
68
+ chat = [
69
+ {
70
+ "role": "system",
71
+ "content": "Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
72
+ },
73
+ {
74
+ "role": "user",
75
+ "content": "<|audio|>can you transcribe the speech into a written format?",
76
+ }
77
+ ]
78
+
79
+ text = tokenizer.apply_chat_template(
80
+ chat, tokenize=False, add_generation_prompt=True
81
+ )
82
+
83
+ # compute audio embeddings
84
+ model_inputs = speech_granite_processor(
85
+ text,
86
+ wav,
87
+ device=device, # Computation device; returned tensors are put on CPU
88
+ return_tensors="pt",
89
+ ).to(device)
90
+
91
+ model_outputs = speech_granite.generate(
92
+ **model_inputs,
93
+ max_new_tokens=1000,
94
+ num_beams=4,
95
+ do_sample=False,
96
+ min_length=1,
97
+ top_p=1.0,
98
+ repetition_penalty=3.0,
99
+ length_penalty=1.0,
100
+ temperature=1.0,
101
+ bos_token_id=tokenizer.bos_token_id,
102
+ eos_token_id=tokenizer.eos_token_id,
103
+ pad_token_id=tokenizer.pad_token_id,
104
+ )
105
+
106
+ # Transformers includes the input IDs in the response.
107
+ num_input_tokens = model_inputs["input_ids"].shape[-1]
108
+ new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)
109
+
110
+ output_text = tokenizer.batch_decode(
111
+ new_tokens, add_special_tokens=False, skip_special_tokens=True
112
+ )
113
+ print(f"STT output = {output_text[0].upper()}")
114
+ ```
115
+
116
+ **Model Architecture:**
117
+
118
+ The architecture of granite-speech-3.2-8b consists of the following components:
119
+
120
+ (1) Speech encoder: 10 conformer blocks trained with Connectionist Temporal Classification (CTC) on character-level targets on the subset containing
121
+ only ASR corpora (see configuration below). In addition, our CTC encoder uses block-attention with 4-seconds audio blocks and self-conditioned CTC
122
+ from the middle layer.
123
+
124
+ | Configuration parameter | Value |
125
+ |-----------------|----------------------|
126
+ | Input dimension | 160 (80 logmels x 2) |
127
+ | Nb. of layers | 10 |
128
+ | Hidden dimension | 1024 |
129
+ | Nb. of attention heads | 8 |
130
+ | Attention head size | 128 |
131
+ | Convolution kernel size | 15 |
132
+ | Output dimension | 42 |
133
+
134
+ (2) Speech projector and temporal downsampler (speech-text modality adapter): we use a 2-layer window query transformer (q-former) operating on
135
+ blocks of 15 1024-dimensional acoustic embeddings coming out of the last conformer block of the speech encoder that get downsampled by a factor of 5
136
+ using 3 trainable queries per block and per layer. The total temporal downsampling factor is 10 (2x from the encoder and 5x from the projector)
137
+ resulting in a 10Hz acoustic embeddings rate for the LLM. The encoder, projector and LoRA adapters were fine-tuned/trained jointly on all the
138
+ corpora mentioned under **Training Data**.
139
+
140
+ (3) Large language model: granite-3.2-8b-instruct with 128k context length (https://huggingface.co/ibm-granite/granite-3.2-8b-instruct).
141
+
142
+ (4) LoRA adapters: rank=64 applied to the query, value projection matrices
143
+
144
+ **Training Data:**
145
+
146
+ Overall, our training data is largely comprised of two key sources: (1) publicly available datasets (2) Synthetic data created from publicly
147
+ available datasets specifically targeting the speech translation task. A detailed description of the training datasets can be found in the table
148
+ below:
149
+
150
+ | Name | Task | Nb. hours | Source |
151
+ |-----------|--------------|----------------|--------------|
152
+ | CommonVoice-17 English | ASR | 2600 | https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0 |
153
+ | MLS English | ASR | 44000 | https://huggingface.co/datasets/facebook/multilingual_librispeech |
154
+ | Librispeech | ASR | 1000 | https://huggingface.co/datasets/openslr/librispeech_asr |
155
+ | VoxPopuli English | ASR | 500 | https://huggingface.co/datasets/facebook/voxpopuli |
156
+ | AMI | ASR | 100 | https://huggingface.co/datasets/edinburghcstr/ami |
157
+ | YODAS English | ASR | 10000 | https://huggingface.co/datasets/espnet/yodas
158
+ | CommonVoice-17 En->Ja | AST | 2600 | translated with granite-3.2-8b and phi-4 |
159
+ | CommonVoice-17 En->De | AST | 2600 | translated with granite-3.2-8b and phi-4 |
160
+ | MLS English | other | 44000 | transcripts description provided by granite-3.1-8b |
161
+ | CREMA-D | SER | 3 | https://github.com/CheyneyComputerScience/CREMA-D |
162
+ | MELD | SER | 7 | https://github.com/declare-lab/MELD |
163
+
164
+ **Infrastructure:**
165
+ We train Granite Speech using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable
166
+ and efficient infrastructure for training our models over thousands of GPUs. The training of this particular model was completed in 10 days on 32
167
+ H100 GPUs.
168
+
169
+ **Ethical Considerations and Limitations:**
170
+ Ethical Considerations and Limitations: The use of Large Speech and Language Models may involve risks and ethical considerations that people should
171
+ be aware of. These risks may include bias and fairness, misinformation, and autonomous decision-making. We urge the community to use granite-speech
172
+ 3.2-8b in a manner consistent with IBM’s Responsible Use Guide or similar responsible use structures. IBM recommends using this model for automatic
173
+ speech recognition tasks. Note that more general speech tasks may pose higher inherent risks of triggering unwanted outputs. To enhance safety, we
174
+ recommend using granite-speech-3.2-8b alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks
175
+ in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas. Its training, which includes both human-annotated and synthetic
176
+ data informed by internal red-teaming, enables it to outperform similar open-source models on standard benchmarks, providing an additional layer of
177
+ safety.
178
+
179
+ **Resources**
180
+ - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite
181
+ - 🚀 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
182
+ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources