ibm-granite
/

granite-speech-3.3-8b

@@ -9,21 +9,21 @@ library_name: transformers
 # Granite-speech-3.3-8b
 **Model Summary:**
-Granite-speech-3.2-8b is a compact and efficient speech-language model, specifically designed for automatic speech recognition (ASR) and automatic speech translation (AST). Granite-speech-3.2-8b uses a two-pass design, unlike integrated models that combine speech and language into a single pass. Initial calls to granite-speech-3.2-8b will transcribe audio files into text. To process the transcribed text using the underlying Granite language model, users must make a second call as each step must be explicitly initiated.
-The model was trained on a collection of public corpora comprising diverse datasets for ASR and AST as well as synthetic datasets tailored to support the speech translation task. Granite-speech-3.2 was trained by modality aligning granite-3.2-8b-instruct (https://huggingface.co/ibm-granite/granite-3.2-8b-instruct) to speech on publicly available open source corpora containing audio inputs and text targets.
 **Evaluations:**
-We evaluated granite-speech-3.2-8b alongside other speech-language models (SLMs) in the less than 8b parameter range as well as dedicated ASR and AST systems on standard benchmarks. The evaluation spanned multiple public benchmarks, with particular emphasis on English ASR tasks while also including AST for En-X translation.
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/ZX49euxuzd45QcpWwp5Yz.png)
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/JSGpEMSTquwsAFOYBx7AZ.png)
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/zwpNY8J8bD46EU_ksMEb-.png)
-**Release Date**: April 2nd, 2025
 **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
@@ -39,15 +39,9 @@ Granite Speech model is in the process of being supported natively in `transform
 ### Usage with `transformers`
-First, make sure to build the latest version of transformers:
 ```shell
-pip install transformers>=4.49 peft torchaudio
-```
-Install a `torchaudio` backend, such as:
-```shell
-pip install soundfile
 ```
 Then run the code:
@@ -59,12 +53,12 @@ from huggingface_hub import hf_hub_download
 device = "cuda" if torch.cuda.is_available() else "cpu"
-model_name = "ibm-granite/granite-speech-3.2-8b"
 speech_granite_processor = AutoProcessor.from_pretrained(
-    model_name, trust_remote_code=True)
 tokenizer = speech_granite_processor.tokenizer
 speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(
-    model_name, trust_remote_code=True).to(device)
 # prepare speech and text prompt, using the appropriate prompt template
@@ -76,7 +70,7 @@ assert wav.shape[0] == 1 and sr == 16000 # mono, 16khz
 chat = [
     {
         "role": "system",
-        "content": "Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
     },
     {
         "role": "user",
@@ -145,7 +139,7 @@ using 3 trainable queries per block and per layer. The total temporal downsampli
 resulting in a 10Hz acoustic embeddings rate for the LLM. The encoder, projector and LoRA adapters were fine-tuned/trained jointly on all the
 corpora mentioned under **Training Data**.
-(3) Large language model: granite-3.2-8b-instruct with 128k context length (https://huggingface.co/ibm-granite/granite-3.2-8b-instruct).
 (4) LoRA adapters: rank=64 applied to the query, value projection matrices
@@ -162,12 +156,13 @@ below:
 | Librispeech             | ASR | 1000 |  https://huggingface.co/datasets/openslr/librispeech_asr |
 | VoxPopuli English       | ASR | 500 |  https://huggingface.co/datasets/facebook/voxpopuli |
 | AMI                     | ASR | 100 | https://huggingface.co/datasets/edinburghcstr/ami |
-| YODAS English           | ASR | 10000 |  https://huggingface.co/datasets/espnet/yodas
-| CommonVoice-17 En->Ja   | AST | 2600 | translated with granite-3.2-8b and phi-4 |
-| CommonVoice-17 En->De   | AST | 2600 | translated with granite-3.2-8b and phi-4 |
-| MLS English             | other | 44000 | transcripts description provided by granite-3.1-8b |
-| CREMA-D                 | SER | 3 | https://github.com/CheyneyComputerScience/CREMA-D |
-| MELD                    | SER | 7 | https://github.com/declare-lab/MELD |
 **Infrastructure:**
 We train Granite Speech using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable
@@ -175,9 +170,9 @@ and efficient infrastructure for training our models over thousands of GPUs. The
 H100 GPUs.
 **Ethical Considerations and Limitations:**
-The use of Large Speech and Language Models may involve risks and ethical considerations that people should be aware of. These risks may include bias and fairness, misinformation, and autonomous decision-making. We urge the community to use granite-speech-3.2-8b in a manner consistent with IBM's Responsible Use Guide or similar responsible use structures. IBM recommends using this model for automatic speech recognition tasks. The model's modular design improves safety by limiting how audio inputs can influence the system. If an unfamiliar or malformed prompt is received, the model simply echoes it with its transcription. This minimizes the risk of adversarial inputs, unlike integrated models that directly interpret audio and may be more exposed to such attacks. Note that more general speech tasks may pose higher inherent risks of triggering unwanted outputs.
-To enhance safety, we recommend using granite-speech-3.2-8b alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas. Its training, which includes both human-annotated and synthetic data informed by internal red-teaming, enables it to outperform similar open-source models on standard benchmarks, providing an additional layer of safety.
 **Resources**
 - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite

 # Granite-speech-3.3-8b
 **Model Summary:**
+Granite-speech-3.3-8b is a compact and efficient speech-language model, specifically designed for automatic speech recognition (ASR) and automatic speech translation (AST). Granite-speech-3.3-8b uses a two-pass design, unlike integrated models that combine speech and language into a single pass. Initial calls to granite-speech-3.3-8b will transcribe audio files into text. To process the transcribed text using the underlying Granite language model, users must make a second call as each step must be explicitly initiated.
+The model was trained on a collection of public corpora comprising diverse datasets for ASR and AST as well as synthetic datasets tailored to support the speech translation task. Granite-speech-3.3 was trained by modality aligning granite-3.3-8b-instruct (https://huggingface.co/ibm-granite/granite-3.3-8b-instruct) to speech on publicly available open source corpora containing audio inputs and text targets.
 **Evaluations:**
+We evaluated granite-speech-3.3-8b alongside other speech-language models (SLMs) in the less than 8b parameter range as well as dedicated ASR and AST systems on standard benchmarks. The evaluation spanned multiple public benchmarks, with particular emphasis on English ASR tasks while also including AST for En-X translation.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/Q1cl571NmTi9-c8vRVJB2.png)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/L3ply0GAbPWZn36UZAjeL.png)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/666ec38102791b3b49f453e8/o9iK_6cZEIw1u5UAYf1YB.png)
+**Release Date**: April 15, 2025
 **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 ### Usage with `transformers`
+First, make sure to build the latest version of transformers from source:
 ```shell
+pip install https://github.com/huggingface/transformers/archive/main.zip torchaudio peft soundfile
 ```
 Then run the code:
 device = "cuda" if torch.cuda.is_available() else "cpu"
+model_name = "ibm-granite/granite-speech-3.3-8b"
 speech_granite_processor = AutoProcessor.from_pretrained(
+    model_name)
 tokenizer = speech_granite_processor.tokenizer
 speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(
+    model_name).to(device)
 # prepare speech and text prompt, using the appropriate prompt template
 chat = [
     {
         "role": "system",
+        "content": "Knowledge Cutoff Date: April 2024.\nToday's Date: April 9, 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
     },
     {
         "role": "user",
 resulting in a 10Hz acoustic embeddings rate for the LLM. The encoder, projector and LoRA adapters were fine-tuned/trained jointly on all the
 corpora mentioned under **Training Data**.
+(3) Large language model: granite-3.3-8b-instruct with 128k context length (https://huggingface.co/ibm-granite/granite-3.3-8b-instruct).
 (4) LoRA adapters: rank=64 applied to the query, value projection matrices
 | Librispeech             | ASR | 1000 |  https://huggingface.co/datasets/openslr/librispeech_asr |
 | VoxPopuli English       | ASR | 500 |  https://huggingface.co/datasets/facebook/voxpopuli |
 | AMI                     | ASR | 100 | https://huggingface.co/datasets/edinburghcstr/ami |
+| YODAS English           | ASR | 10000 |  https://huggingface.co/datasets/espnet/yodas |
+| Switchboard English     | ASR | 260 | https://catalog.ldc.upenn.edu/LDC97S62 |
+| CallHome English        | ASR | 18  | https://catalog.ldc.upenn.edu/LDC97T14 |
+| Fisher                  | ASR | 2000 | https://catalog.ldc.upenn.edu/LDC2004S13 |
+| Voicemail part I        | ASR | 40 | https://catalog.ldc.upenn.edu/LDC98S77 |
+| Voicemail part II       | ASR | 40 | https://catalog.ldc.upenn.edu/LDC98S77 |
+| CommonVoice-17 En->De,Es,Fr,It,Ja,Pt,Zh | AST | 2600*7 | ensemble translation with granite-3.2-8b, phi-4 and MADLAD |
 **Infrastructure:**
 We train Granite Speech using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable
 H100 GPUs.
 **Ethical Considerations and Limitations:**
+The use of Large Speech and Language Models may involve risks and ethical considerations that people should be aware of. These risks may include bias and fairness, misinformation, and autonomous decision-making. We urge the community to use granite-speech-3.3-8b in a manner consistent with IBM's Responsible Use Guide or similar responsible use structures. IBM recommends using this model for automatic speech recognition tasks. The model's modular design improves safety by limiting how audio inputs can influence the system. If an unfamiliar or malformed prompt is received, the model simply echoes it with its transcription. This minimizes the risk of adversarial inputs, unlike integrated models that directly interpret audio and may be more exposed to such attacks. Note that more general speech tasks may pose higher inherent risks of triggering unwanted outputs.
+To enhance safety, we recommend using granite-speech-3.3-8b alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas. Its training, which includes both human-annotated and synthetic data informed by internal red-teaming, enables it to outperform similar open-source models on standard benchmarks, providing an additional layer of safety.
 **Resources**
 - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite