gsaon commited on
Commit
6b2b3a3
·
verified ·
1 Parent(s): 215cd0f

Update README.md

Browse files

Changes for granite-speech-3.3

Files changed (1) hide show
  1. README.md +23 -28
README.md CHANGED
@@ -9,21 +9,21 @@ library_name: transformers
9
  # Granite-speech-3.3-8b
10
 
11
  **Model Summary:**
12
- Granite-speech-3.2-8b is a compact and efficient speech-language model, specifically designed for automatic speech recognition (ASR) and automatic speech translation (AST). Granite-speech-3.2-8b uses a two-pass design, unlike integrated models that combine speech and language into a single pass. Initial calls to granite-speech-3.2-8b will transcribe audio files into text. To process the transcribed text using the underlying Granite language model, users must make a second call as each step must be explicitly initiated.
13
 
14
- The model was trained on a collection of public corpora comprising diverse datasets for ASR and AST as well as synthetic datasets tailored to support the speech translation task. Granite-speech-3.2 was trained by modality aligning granite-3.2-8b-instruct (https://huggingface.co/ibm-granite/granite-3.2-8b-instruct) to speech on publicly available open source corpora containing audio inputs and text targets.
15
 
16
  **Evaluations:**
17
 
18
- We evaluated granite-speech-3.2-8b alongside other speech-language models (SLMs) in the less than 8b parameter range as well as dedicated ASR and AST systems on standard benchmarks. The evaluation spanned multiple public benchmarks, with particular emphasis on English ASR tasks while also including AST for En-X translation.
19
 
20
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/ZX49euxuzd45QcpWwp5Yz.png)
21
 
22
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/JSGpEMSTquwsAFOYBx7AZ.png)
23
 
24
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/zwpNY8J8bD46EU_ksMEb-.png)
25
 
26
- **Release Date**: April 2nd, 2025
27
 
28
  **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
29
 
@@ -39,15 +39,9 @@ Granite Speech model is in the process of being supported natively in `transform
39
 
40
  ### Usage with `transformers`
41
 
42
- First, make sure to build the latest version of transformers:
43
  ```shell
44
- pip install transformers>=4.49 peft torchaudio
45
- ```
46
-
47
- Install a `torchaudio` backend, such as:
48
-
49
- ```shell
50
- pip install soundfile
51
  ```
52
 
53
  Then run the code:
@@ -59,12 +53,12 @@ from huggingface_hub import hf_hub_download
59
 
60
  device = "cuda" if torch.cuda.is_available() else "cpu"
61
 
62
- model_name = "ibm-granite/granite-speech-3.2-8b"
63
  speech_granite_processor = AutoProcessor.from_pretrained(
64
- model_name, trust_remote_code=True)
65
  tokenizer = speech_granite_processor.tokenizer
66
  speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(
67
- model_name, trust_remote_code=True).to(device)
68
 
69
  # prepare speech and text prompt, using the appropriate prompt template
70
 
@@ -76,7 +70,7 @@ assert wav.shape[0] == 1 and sr == 16000 # mono, 16khz
76
  chat = [
77
  {
78
  "role": "system",
79
- "content": "Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
80
  },
81
  {
82
  "role": "user",
@@ -145,7 +139,7 @@ using 3 trainable queries per block and per layer. The total temporal downsampli
145
  resulting in a 10Hz acoustic embeddings rate for the LLM. The encoder, projector and LoRA adapters were fine-tuned/trained jointly on all the
146
  corpora mentioned under **Training Data**.
147
 
148
- (3) Large language model: granite-3.2-8b-instruct with 128k context length (https://huggingface.co/ibm-granite/granite-3.2-8b-instruct).
149
 
150
  (4) LoRA adapters: rank=64 applied to the query, value projection matrices
151
 
@@ -162,12 +156,13 @@ below:
162
  | Librispeech | ASR | 1000 | https://huggingface.co/datasets/openslr/librispeech_asr |
163
  | VoxPopuli English | ASR | 500 | https://huggingface.co/datasets/facebook/voxpopuli |
164
  | AMI | ASR | 100 | https://huggingface.co/datasets/edinburghcstr/ami |
165
- | YODAS English | ASR | 10000 | https://huggingface.co/datasets/espnet/yodas
166
- | CommonVoice-17 En->Ja | AST | 2600 | translated with granite-3.2-8b and phi-4 |
167
- | CommonVoice-17 En->De | AST | 2600 | translated with granite-3.2-8b and phi-4 |
168
- | MLS English | other | 44000 | transcripts description provided by granite-3.1-8b |
169
- | CREMA-D | SER | 3 | https://github.com/CheyneyComputerScience/CREMA-D |
170
- | MELD | SER | 7 | https://github.com/declare-lab/MELD |
 
171
 
172
  **Infrastructure:**
173
  We train Granite Speech using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable
@@ -175,9 +170,9 @@ and efficient infrastructure for training our models over thousands of GPUs. The
175
  H100 GPUs.
176
 
177
  **Ethical Considerations and Limitations:**
178
- The use of Large Speech and Language Models may involve risks and ethical considerations that people should be aware of. These risks may include bias and fairness, misinformation, and autonomous decision-making. We urge the community to use granite-speech-3.2-8b in a manner consistent with IBM's Responsible Use Guide or similar responsible use structures. IBM recommends using this model for automatic speech recognition tasks. The model's modular design improves safety by limiting how audio inputs can influence the system. If an unfamiliar or malformed prompt is received, the model simply echoes it with its transcription. This minimizes the risk of adversarial inputs, unlike integrated models that directly interpret audio and may be more exposed to such attacks. Note that more general speech tasks may pose higher inherent risks of triggering unwanted outputs.
179
 
180
- To enhance safety, we recommend using granite-speech-3.2-8b alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas. Its training, which includes both human-annotated and synthetic data informed by internal red-teaming, enables it to outperform similar open-source models on standard benchmarks, providing an additional layer of safety.
181
 
182
  **Resources**
183
  - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite
 
9
  # Granite-speech-3.3-8b
10
 
11
  **Model Summary:**
12
+ Granite-speech-3.3-8b is a compact and efficient speech-language model, specifically designed for automatic speech recognition (ASR) and automatic speech translation (AST). Granite-speech-3.3-8b uses a two-pass design, unlike integrated models that combine speech and language into a single pass. Initial calls to granite-speech-3.3-8b will transcribe audio files into text. To process the transcribed text using the underlying Granite language model, users must make a second call as each step must be explicitly initiated.
13
 
14
+ The model was trained on a collection of public corpora comprising diverse datasets for ASR and AST as well as synthetic datasets tailored to support the speech translation task. Granite-speech-3.3 was trained by modality aligning granite-3.3-8b-instruct (https://huggingface.co/ibm-granite/granite-3.3-8b-instruct) to speech on publicly available open source corpora containing audio inputs and text targets.
15
 
16
  **Evaluations:**
17
 
18
+ We evaluated granite-speech-3.3-8b alongside other speech-language models (SLMs) in the less than 8b parameter range as well as dedicated ASR and AST systems on standard benchmarks. The evaluation spanned multiple public benchmarks, with particular emphasis on English ASR tasks while also including AST for En-X translation.
19
 
20
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/Q1cl571NmTi9-c8vRVJB2.png)
21
 
22
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/L3ply0GAbPWZn36UZAjeL.png)
23
 
24
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/o9iK_6cZEIw1u5UAYf1YB.png)
25
 
26
+ **Release Date**: April 15, 2025
27
 
28
  **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
29
 
 
39
 
40
  ### Usage with `transformers`
41
 
42
+ First, make sure to build the latest version of transformers from source:
43
  ```shell
44
+ pip install https://github.com/huggingface/transformers/archive/main.zip torchaudio peft soundfile
 
 
 
 
 
 
45
  ```
46
 
47
  Then run the code:
 
53
 
54
  device = "cuda" if torch.cuda.is_available() else "cpu"
55
 
56
+ model_name = "ibm-granite/granite-speech-3.3-8b"
57
  speech_granite_processor = AutoProcessor.from_pretrained(
58
+ model_name)
59
  tokenizer = speech_granite_processor.tokenizer
60
  speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(
61
+ model_name).to(device)
62
 
63
  # prepare speech and text prompt, using the appropriate prompt template
64
 
 
70
  chat = [
71
  {
72
  "role": "system",
73
+ "content": "Knowledge Cutoff Date: April 2024.\nToday's Date: April 9, 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
74
  },
75
  {
76
  "role": "user",
 
139
  resulting in a 10Hz acoustic embeddings rate for the LLM. The encoder, projector and LoRA adapters were fine-tuned/trained jointly on all the
140
  corpora mentioned under **Training Data**.
141
 
142
+ (3) Large language model: granite-3.3-8b-instruct with 128k context length (https://huggingface.co/ibm-granite/granite-3.3-8b-instruct).
143
 
144
  (4) LoRA adapters: rank=64 applied to the query, value projection matrices
145
 
 
156
  | Librispeech | ASR | 1000 | https://huggingface.co/datasets/openslr/librispeech_asr |
157
  | VoxPopuli English | ASR | 500 | https://huggingface.co/datasets/facebook/voxpopuli |
158
  | AMI | ASR | 100 | https://huggingface.co/datasets/edinburghcstr/ami |
159
+ | YODAS English | ASR | 10000 | https://huggingface.co/datasets/espnet/yodas |
160
+ | Switchboard English | ASR | 260 | https://catalog.ldc.upenn.edu/LDC97S62 |
161
+ | CallHome English | ASR | 18 | https://catalog.ldc.upenn.edu/LDC97T14 |
162
+ | Fisher | ASR | 2000 | https://catalog.ldc.upenn.edu/LDC2004S13 |
163
+ | Voicemail part I | ASR | 40 | https://catalog.ldc.upenn.edu/LDC98S77 |
164
+ | Voicemail part II | ASR | 40 | https://catalog.ldc.upenn.edu/LDC98S77 |
165
+ | CommonVoice-17 En->De,Es,Fr,It,Ja,Pt,Zh | AST | 2600*7 | ensemble translation with granite-3.2-8b, phi-4 and MADLAD |
166
 
167
  **Infrastructure:**
168
  We train Granite Speech using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable
 
170
  H100 GPUs.
171
 
172
  **Ethical Considerations and Limitations:**
173
+ The use of Large Speech and Language Models may involve risks and ethical considerations that people should be aware of. These risks may include bias and fairness, misinformation, and autonomous decision-making. We urge the community to use granite-speech-3.3-8b in a manner consistent with IBM's Responsible Use Guide or similar responsible use structures. IBM recommends using this model for automatic speech recognition tasks. The model's modular design improves safety by limiting how audio inputs can influence the system. If an unfamiliar or malformed prompt is received, the model simply echoes it with its transcription. This minimizes the risk of adversarial inputs, unlike integrated models that directly interpret audio and may be more exposed to such attacks. Note that more general speech tasks may pose higher inherent risks of triggering unwanted outputs.
174
 
175
+ To enhance safety, we recommend using granite-speech-3.3-8b alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas. Its training, which includes both human-annotated and synthetic data informed by internal red-teaming, enables it to outperform similar open-source models on standard benchmarks, providing an additional layer of safety.
176
 
177
  **Resources**
178
  - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite