diarray commited on
Commit
db0f9ff
·
verified ·
1 Parent(s): f891829

Push model using huggingface_hub.

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +141 -0
  3. soloni-114m-tdt-ctc.nemo +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ soloni-114m-tdt-ctc.nemo filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - bm
4
+ library_name: nemo
5
+ datasets:
6
+ - RobotsMali/bam-asr-all
7
+
8
+ thumbnail: null
9
+ tags:
10
+ - automatic-speech-recognition
11
+ - speech
12
+ - audio
13
+ - Transducer
14
+ - TDT
15
+ - FastConformer
16
+ - Conformer
17
+ - pytorch
18
+ - Bambara
19
+ - NeMo
20
+ license: cc-by-4.0
21
+ base_model: nvidia/parakeet-tdt_ctc-110m
22
+ model-index:
23
+ - name: soloni-114m-tdt-ctc
24
+ results:
25
+ - task:
26
+ name: Automatic Speech Recognition
27
+ type: automatic-speech-recognition
28
+ dataset:
29
+ name: bam-asr-all
30
+ type: RobotsMali/bam-asr-all
31
+ split: test
32
+ args:
33
+ language: bm
34
+ metrics:
35
+ - name: Test WER (TDT)
36
+ type: wer
37
+ value: 66.7
38
+ - name: Test WER (CTC)
39
+ type: wer
40
+ value: 40.6
41
+
42
+ metrics:
43
+ - wer
44
+ pipeline_tag: automatic-speech-recognition
45
+ ---
46
+
47
+ # Soloni TDT-CTC 114M Bambara
48
+
49
+ <style>
50
+ img {
51
+ display: inline;
52
+ }
53
+ </style>
54
+
55
+ [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--TDT-lightgrey#model-badge)](#model-architecture)
56
+ | [![Model size](https://img.shields.io/badge/Params-114M-lightgrey#model-badge)](#model-architecture)
57
+ | [![Language](https://img.shields.io/badge/Language-bm-lightgrey#model-badge)](#datasets)
58
+
59
+ `soloni-114m-tdt-ctc` is a fine tuned version of nvidia's [`parakeet-tdt_ctc-110m`](https://huggingface.co/nvidia/parakeet-tdt_ctc-110m) that transcribes bambara language speech. Unlike its base model, this model cannot write Punctuations and Capitalizations since these were absent from its training.
60
+ The model was fine-tuned using **NVIDIA NeMo** and supports **both TDT (Token-and-Duration Transducer) and CTC (Connectionist Temporal Classification) decoding**.
61
+
62
+ ## NVIDIA NeMo: Training
63
+
64
+ To fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
65
+
66
+ ```bash
67
+ pip install nemo_toolkit['asr']
68
+ ```
69
+
70
+ ## How to Use This Model
71
+
72
+ Note that this model has been released for research purposes primarily.
73
+
74
+ ### Load Model with NeMo
75
+ ```python
76
+ import nemo.collections.asr as nemo_asr
77
+ asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="RobotsMali/soloni-114m-tdt-ctc")
78
+ ```
79
+
80
+ ### Transcribe Audio
81
+ ```python
82
+ # Assuming you have a test audio file named sample_audio.wav
83
+ asr_model.transcribe(['sample_audio.wav'])
84
+ ```
85
+
86
+ ### Input
87
+
88
+ This model accepts **16000 Hz mono-channel** audio (wav files) as input.
89
+
90
+ ### Output
91
+
92
+ This model provides transcribed speech as a string for a given audio sample.
93
+
94
+ ## Model Architecture
95
+
96
+ This model uses a Hybrid FastConformer-TDT-CTC architecture. FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
97
+
98
+ ## Training
99
+
100
+ The NeMo toolkit was used for finetuning this model for **16,296 steps** over `parakeet-tdt_ctc-110m` model. This model is trained with this [base config](https://github.com/diarray-hub/bambara-asr/blob/main/configs/parakeet-110m-config-v6.yaml). The full training configurations, scripts, and experimental logs are available here:
101
+ 🔗 [Bambara-ASR Experiments](https://github.com/diarray-hub/bambara-asr)
102
+
103
+ The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
104
+
105
+ ## Dataset
106
+ This model was fine-tuned on the [bam-asr-all](https://huggingface.co/datasets/RobotsMali/bam-asr-all) dataset, which consists of 37 hours of transcribed Bambara speech data. The dataset is primarily derived from **Jeli-ASR dataset** (~87%).
107
+
108
+ ## Performance
109
+
110
+ The performance of Automatic Speech Recognition models is measured using Word Error Rate. Since this model has two decoders operating independently, each decoder is evaluated independently too.
111
+
112
+ The following table summarizes the performance of the available models in this collection with the Transducer decoder. Performances of the ASR models are reported in terms of **Word Error Rate (WER%)**.
113
+
114
+ |**Decoder (Version)**|**Tokenizer**|**Vocabulary Size**|**bam-asr-all (test set)**|
115
+ |---------|-----------------------|-----------------|---------|
116
+ | CTC (V6) | BPE | 1024 | 40.6 |
117
+ |---------|-----------------------|-----------------|---------|
118
+ | TDT (V6) | BPE | 1024 | 66.7 |
119
+
120
+ These are greedy WER numbers without external LM. By default the main decoder branch is the TDT branch, if you would like to switch to the CTC decoder simply run this block of code before calling the .transcribe method
121
+
122
+ ```python
123
+ # Retrieve the CTC decoding config
124
+ ctc_decoding_cfg = model.cfg.aux_ctc.decoding
125
+ # Then change the decoding strategy
126
+ asr_model.change_decoding_strategy(decoder_type='ctc', decoding_cfg=ctc_decoding_cfg)
127
+ # Transcribe with the CTC decoder
128
+ asr_model.transcribe(['sample_audio.wav'])
129
+ ```
130
+
131
+ ## License
132
+ This model is released under the **CC-BY-4.0** license. By using this model, you agree to the terms of the license.
133
+
134
+ ---
135
+
136
+ More details are available in the **Experimental Technical Report**:
137
+ 📄 [Draft Technical Report - Weights & Biases](https://wandb.ai/yacoudiarra-wl/bam-asr-nemo-training/reports/Draft-Technical-Report-V1--VmlldzoxMTIyOTMzOA).
138
+
139
+ Feel free to open a discussion on Hugging Face or [file an issue](https://github.com/diarray-hub/bambara-asr/issues) on github if you have any contributions
140
+
141
+ ---
soloni-114m-tdt-ctc.nemo ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:67d28e9b2900ffc5751181ea783a0932b5805cf7690ef14d0019fd3e524dedb0
3
+ size 459243520