File size: 3,071 Bytes
0ee4da8 e838033 0ee4da8 e838033 1f33e0a e838033 1f33e0a e838033 1f33e0a e838033 8d0d873 e838033 8711d1d e838033 8711d1d e838033 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
license: unknown
language:
- en
metrics:
- wer
tags:
- whisper
- speech processing
- nlp
- asr
- domain adaptation
---
# Whispered TIA
Whispered TIA is a fine-tuned ASR model based on Whisper. It is adapted to the software
<a href="https://www.siemens.com/de/de/produkte/automatisierung/industrie-software/automatisierungs-software/tia-portal.html">TIA (Totally Integrated Automation)</a> from Siemens AG and is able to predict domain specific words and to transcribe them correctly.
# Base Model Whisper
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.
Whisper was proposed in the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
by Alec Radford et al. from OpenAI. The original code repository can be found [here](https://github.com/openai/whisper).
# Training Results
The False HallucER indicates how many hallucinations and deletions were produced.
<!DOCTYPE html>
<html>
<head>
<style>
table {
width: 100%;
border-collapse: collapse;
}
th, td {
padding: 8px;
text-align: left;
border-bottom: 1px solid #ddd;
}
th {
background-color: #f2f2f2;
}
</style>
</head>
<body>
<table>
<tr>
<th>WER</th>
<th>False HallucER</th>
<th>Runtime</th>
<th>Batch Size</th>
<th>Memory Usage</th>
<tr>
<td>1.6</td>
<td>499.76</td>
<td>1.72</td>
<td>64</td>
<td>20049</td>
</tr>
<tr>
<td>~</td>
<td>Predictions > References: 34%</td>
<td>~</td>
<td>~</td>
<td>~</td>
</tr>
<tr>
<td>~</td>
<td>Predictions < References: 30%</td>
<td>~</td>
<td>~</td>
<td>~</td>
</tr>
<tr>
<td>~</td>
<td>Predictions = References: 35%</td>
<td>~</td>
<td>~</td>
<td>~</td>
</tr>
</table>
</body>
</html>
# Dataset
For more information on the underlying dataset, see <a href="https://huggingface.co/datasets/masters-thesis-vm/whispered_TIA_normal">dataset: normal</a>.
# Inference
```python
import librosa
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# Insert audio file
file = "/path/to/audio"
# Convert to Mel Spectrogram
arr, sampling_rate = librosa.load(file, sr=16000)
# Load whisper model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("masters-thesis-vm/whispered_TIA_small_ad_tokenization_encoder_freezing_normal")
# Preprocessing
input_features = processor(arr, return_tensors="pt", sampling_rate=sampling_rate).input_features
# Prediction
forced_decoder_ids = processor.get_decoder_prompt_ids(language="en", task="transcribe")
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)
```
|