NeMo / docs /source /asr /speaker_recognition /configs.rst

thanks to NVIDIA ❤

7934b29 about 2 years ago

4.75 kB

	NeMo Speaker Recognition Configuration Files
	============================================

	This page covers NeMo configuration file setup that is specific to speaker recognition models.
	For general information about how to set up and run experiments that is common to all NeMo models (e.g.
	experiment manager and PyTorch Lightning trainer parameters), see the :doc:`../../core/core` page.

	The model section of NeMo speaker recognition configuration files will generally require information about the dataset(s) being
	used, the preprocessor for audio files, parameters for any augmentation being performed, as well as the
	model architecture specification.
	The sections on this page cover each of these in more detail.

	Example configuration files for all of the Speaker related scripts can be found in the
	config directory of the examples ``{NEMO_ROOT/examples/speaker_tasks/recognition/conf}``.


	Dataset Configuration
	---------------------

	Training, validation, and test parameters are specified using the ``train_ds``, ``validation_ds``, and
	``test_ds`` sections of your configuration file, respectively.
	Depending on the task, you may have arguments specifying the sample rate of your audio files, max time length to consider for each audio file , whether or not to shuffle the dataset, and so on.
	You may also decide to leave fields such as the ``manifest_filepath`` blank, to be specified via the command line
	at run time.

	Any initialization parameters that are accepted for the Dataset class used in your experiment
	can be set in the config file.

	An example TitaNet train and validation configuration could look like (``{NEMO_ROOT}examples/speaker_tasks/recognition/conf/titanet-large.yaml``):

	.. code-block:: yaml

	model:
	train_ds:
	manifest_filepath: ???
	sample_rate: 16000
	labels: None # finds labels based on manifest file
	batch_size: 32
	trim_silence: False
	shuffle: True

	validation_ds:
	manifest_filepath: ???
	sample_rate: 16000
	labels: None # Keep None, to match with labels extracted during training
	batch_size: 32
	shuffle: False # No need to shuffle the validation data


	If you would like to use tarred dataset, have a look at `Datasets Configuration <../configs.html#dataset-configuration>`__.


	Preprocessor Configuration
	--------------------------
	Preprocessor helps to compute MFCC or mel spectrogram features that are given as inputs to model.
	For details on how to write this section, refer to `Preprocessor Configuration <../configs.html#preprocessor-configuration>`__


	Augmentation Configurations
	---------------------------

	For TitaNet training we use on-the-fly augmentations with MUSAN and RIR impulses using ``noise`` augmentor section

	The following example sets up musan augmentation with audio files taken from manifest path and
	minimum and maximum SNR specified with min_snr and max_snr respectively. This section can be added to
	``train_ds`` part in model

	.. code-block:: yaml

	model:
	...
	train_ds:
	...
	augmentor:
	noise:
	manifest_path: /path/to/musan/manifest_file
	prob: 0.2 # probability to augment the incoming batch audio with augmentor data
	min_snr_db: 5
	max_snr_db: 15


	See the :class:`nemo.collections.asr.parts.preprocessing.perturb.AudioAugmentor` API section for more details.


	Model Architecture Configurations
	---------------------------------

	Each configuration file should describe the model architecture being used for the experiment.
	Models in the NeMo ASR collection need a ``encoder`` section and a ``decoder`` section, with the ``_target_`` field
	specifying the module to use for each.

	The following sections go into more detail about the specific configurations of each model architecture.

	For more information about the TitaNet Encoder models, see the :doc:`Models <./models>` page.

	Decoder Configurations
	------------------------

	After features have been computed from TitaNet encoder, we pass these features to the decoder to compute embeddings and then to compute log probabilities
	for training models.

	.. code-block:: yaml

	model:
	...
	decoder:
	_target_: nemo.collections.asr.modules.SpeakerDecoder
	feat_in: *enc_feat_out
	num_classes: 7205 # Total number of classes in voxceleb1,2 training manifest file
	pool_mode: attention # xvector, attention
	emb_sizes: 192 # number of intermediate emb layers. can be comma separated for additional layers like 512,512
	angular: true # if true then loss will be changed to angular softmax loss and consider scale and margin from loss section else train with cross-entropy loss

	loss:
	scale: 30
	margin 0.2