|
NeMo Speaker Recognition Configuration Files |
|
============================================ |
|
|
|
This page covers NeMo configuration file setup that is specific to speaker recognition models. |
|
For general information about how to set up and run experiments that is common to all NeMo models (e.g. |
|
experiment manager and PyTorch Lightning trainer parameters), see the :doc:`../../core/core` page. |
|
|
|
The model section of NeMo speaker recognition configuration files will generally require information about the dataset(s) being |
|
used, the preprocessor for audio files, parameters for any augmentation being performed, as well as the |
|
model architecture specification. |
|
The sections on this page cover each of these in more detail. |
|
|
|
Example configuration files for all of the Speaker related scripts can be found in the |
|
config directory of the examples ``{NEMO_ROOT/examples/speaker_tasks/recognition/conf}``. |
|
|
|
|
|
Dataset Configuration |
|
--------------------- |
|
|
|
Training, validation, and test parameters are specified using the ``train_ds``, ``validation_ds``, and |
|
``test_ds`` sections of your configuration file, respectively. |
|
Depending on the task, you may have arguments specifying the sample rate of your audio files, max time length to consider for each audio file , whether or not to shuffle the dataset, and so on. |
|
You may also decide to leave fields such as the ``manifest_filepath`` blank, to be specified via the command line |
|
at run time. |
|
|
|
Any initialization parameters that are accepted for the Dataset class used in your experiment |
|
can be set in the config file. |
|
|
|
An example TitaNet train and validation configuration could look like (``{NEMO_ROOT}examples/speaker_tasks/recognition/conf/titanet-large.yaml``): |
|
|
|
.. code-block:: yaml |
|
|
|
model: |
|
train_ds: |
|
manifest_filepath: ??? |
|
sample_rate: 16000 |
|
labels: None # finds labels based on manifest file |
|
batch_size: 32 |
|
trim_silence: False |
|
shuffle: True |
|
|
|
validation_ds: |
|
manifest_filepath: ??? |
|
sample_rate: 16000 |
|
labels: None # Keep None, to match with labels extracted during training |
|
batch_size: 32 |
|
shuffle: False # No need to shuffle the validation data |
|
|
|
|
|
If you would like to use tarred dataset, have a look at `Datasets Configuration <../configs.html#dataset-configuration>`__. |
|
|
|
|
|
Preprocessor Configuration |
|
-------------------------- |
|
Preprocessor helps to compute MFCC or mel spectrogram features that are given as inputs to model. |
|
For details on how to write this section, refer to `Preprocessor Configuration <../configs.html#preprocessor-configuration>`__ |
|
|
|
|
|
Augmentation Configurations |
|
--------------------------- |
|
|
|
For TitaNet training we use on-the-fly augmentations with MUSAN and RIR impulses using ``noise`` augmentor section |
|
|
|
The following example sets up musan augmentation with audio files taken from manifest path and |
|
minimum and maximum SNR specified with min_snr and max_snr respectively. This section can be added to |
|
``train_ds`` part in model |
|
|
|
.. code-block:: yaml |
|
|
|
model: |
|
... |
|
train_ds: |
|
... |
|
augmentor: |
|
noise: |
|
manifest_path: /path/to/musan/manifest_file |
|
prob: 0.2 # probability to augment the incoming batch audio with augmentor data |
|
min_snr_db: 5 |
|
max_snr_db: 15 |
|
|
|
|
|
See the :class:`nemo.collections.asr.parts.preprocessing.perturb.AudioAugmentor` API section for more details. |
|
|
|
|
|
Model Architecture Configurations |
|
--------------------------------- |
|
|
|
Each configuration file should describe the model architecture being used for the experiment. |
|
Models in the NeMo ASR collection need a ``encoder`` section and a ``decoder`` section, with the ``_target_`` field |
|
specifying the module to use for each. |
|
|
|
The following sections go into more detail about the specific configurations of each model architecture. |
|
|
|
For more information about the TitaNet Encoder models, see the :doc:`Models <./models>` page. |
|
|
|
Decoder Configurations |
|
------------------------ |
|
|
|
After features have been computed from TitaNet encoder, we pass these features to the decoder to compute embeddings and then to compute log probabilities |
|
for training models. |
|
|
|
.. code-block:: yaml |
|
|
|
model: |
|
... |
|
decoder: |
|
_target_: nemo.collections.asr.modules.SpeakerDecoder |
|
feat_in: *enc_feat_out |
|
num_classes: 7205 # Total number of classes in voxceleb1,2 training manifest file |
|
pool_mode: attention # xvector, attention |
|
emb_sizes: 192 # number of intermediate emb layers. can be comma separated for additional layers like 512,512 |
|
angular: true # if true then loss will be changed to angular softmax loss and consider scale and margin from loss section else train with cross-entropy loss |
|
|
|
loss: |
|
scale: 30 |
|
margin 0.2 |
|
|