camenduru
/

NeMo

Model card Files Files and versions Community

NeMo / docs /source /asr /intro.rst

camenduru's picture

thanks to NVIDIA ❤

7934b29 about 2 years ago

history blame contribute delete

3.07 kB

	Automatic Speech Recognition (ASR)
	==================================

	ASR, or Automatic Speech Recognition, refers to the problem of getting a program to automatically transcribe spoken language
	(speech-to-text). Our goal is usually to have a model that minimizes the Word Error Rate (WER) metric when transcribing speech input.
	In other words, given some audio file (e.g. a WAV file) containing speech, how do we transform this into the corresponding text with
	as few errors as possible?

	Traditional speech recognition takes a generative approach, modeling the full pipeline of how speech sounds are produced in order to
	evaluate a speech sample. We would start from a language model that encapsulates the most likely orderings of words that are generated
	(e.g. an n-gram model), to a pronunciation model for each word in that ordering (e.g. a pronunciation table), to an acoustic model that
	translates those pronunciations to audio waveforms (e.g. a Gaussian Mixture Model).

	Then, if we receive some spoken input, our goal would be to find the most likely sequence of text that would result in the given audio
	according to our generative pipeline of models. Overall, with traditional speech recognition, we try to model ``Pr(audio\|transcript)*Pr(transcript)``,
	and take the argmax of this over possible transcripts.

	Over time, neural nets advanced to the point where each component of the traditional speech recognition model could be replaced by a
	neural model that had better performance and that had a greater potential for generalization. For example, we could replace an n-gram
	model with a neural language model, and replace a pronunciation table with a neural pronunciation model, and so on. However, each of
	these neural models need to be trained individually on different tasks, and errors in any model in the pipeline could throw off the
	whole prediction.

	Thus, we can see the appeal of end-to-end ASR architectures: discriminative models that simply take an audio input and give a textual
	output, and in which all components of the architecture are trained together towards the same goal. The model's encoder would be
	akin to an acoustic model for extracting speech features, which can then be directly piped to a decoder which outputs text. If desired,
	we could integrate a language model that would improve our predictions, as well.

	And the entire end-to-end ASR model can be trained at once--a much easier pipeline to handle!

	A demo below allows evaluation of NeMo ASR models in multiple langauges from the browser:

	.. raw:: html

	<iframe src="https://hf.space/embed/smajumdar/nemo_multilingual_language_id/+"
	width="100%" class="gradio-asr"></iframe>

	<script type="text/javascript" language="javascript">
	$('.gradio-asr').css('height', $(window).height()+'px');
	</script>


	The full documentation tree is as follows:

	.. toctree::
	:maxdepth: 8

	models
	datasets
	asr_language_modeling
	results
	scores
	configs
	api
	resources
	examples/kinyarwanda_asr.rst

	.. include:: resources.rst