|
NeMo SSL Configuration Files |
|
============================ |
|
|
|
This page covers NeMo configuration file setup that is specific to models in the Speech Self-Supervised Pre-training collection. |
|
For general information about how to set up and run experiments that is common to all NeMo models (e.g. |
|
experiment manager and PyTorch Lightning trainer parameters), see the :doc:`../../core/core` page. |
|
|
|
Dataset Configuration |
|
|
|
|
|
Dataset configuration for self-supervised model is mostly the same as for standard ASR training, |
|
covered `here <../configs.html |
|
we will need to mask an equivalent amount of patches for all utterances in a batch. This means that we want to avoid |
|
the durations varying too significantly within a single batch. There are several ways you can achieve this in NeMo: |
|
|
|
1) The simplest way is to use the ``min_duration`` parameter in the dataset config, which will simply |
|
discard all utterances below the specified length. This is a viable option if removing these utterances will not |
|
significantly impact the total amount of hours of your dataset. |
|
|
|
2) If your dataset contains many long utterances (longer than ~16 seconds) with varying length, then you may instead |
|
want to use the ``random_segment`` perturbation, which will sample segments of a certain length from the full sample at |
|
runtime (samples below the provided segment length will be padded). You can enable this by adding the following to your |
|
dataset config: |
|
|
|
.. code-block:: yaml |
|
|
|
augmentor: |
|
random_segment: |
|
prob: 1.0 |
|
duration_sec: 16 |
|
|
|
3) You can also use bucketing to ensure similar utterance lengths within batches. |
|
See `Bucketing documentation <../datasets.html |
|
|
|
An example of SSL train and validation configuration should look similar to the following: |
|
|
|
.. code-block:: yaml |
|
|
|
model: |
|
train_ds: |
|
manifest_filepath: ??? |
|
sample_rate: ${model.sample_rate} |
|
batch_size: 16 |
|
shuffle: true |
|
num_workers: 8 |
|
pin_memory: false |
|
use_start_end_token: true |
|
trim_silence: false |
|
max_duration: 16.7 |
|
min_duration: 8.0 |
|
|
|
is_tarred: false |
|
tarred_audio_filepaths: null |
|
shuffle_n: 2048 |
|
|
|
bucketing_strategy: "synced_randomized" |
|
bucketing_batch_size: null |
|
|
|
validation_ds: |
|
manifest_filepath: ??? |
|
sample_rate: ${model.sample_rate} |
|
batch_size: 16 |
|
shuffle: false |
|
num_workers: 8 |
|
pin_memory: true |
|
use_start_end_token: false |
|
min_duration: 8.0 |
|
|
|
|
|
Preprocessor Configuration |
|
|
|
|
|
Preprocessor helps to compute MFCC or mel spectrogram features that are given as inputs to model. |
|
For details on how to write this section, refer to `Preprocessor Configuration <../configs.html |
|
|
|
Augmentation Configurations |
|
|
|
|
|
For self-supervised pre-training, we recommend using the ``MaskedPatchAugmentation`` class for spectrogram masking. |
|
This augmentation divides utterances into fixed size patches, and then masks a fixed amount/fraction of them. You can |
|
also add ``freq_masks`` and ``freq_width`` to apply masking to frequency bands. |
|
|
|
If you are using contrastive loss with negatives sampled from masked steps in same utterance only, |
|
make sure that the total amount of masked steps in each utterance will be big enough for the number of sampled negatives. |
|
For example, if you are using 4x stride and want to sample 100 negatives, then you will need more than 400 masked steps. |
|
If you are using the default ``patch_size`` of 48, then this means you will need to set ``mask_patches`` to at least 9. |
|
When using a fraction of the total amount of patches instead of a fixed amount, you will need to make sure that the |
|
minimum duration of your samples in large enough for the number of negatives to sample. |
|
|
|
.. code-block:: yaml |
|
|
|
spec_augment: |
|
_target_: nemo.collections.asr.modules.MaskedPatchAugmentation |
|
patch_size: 48 |
|
mask_patches: 0.5 |
|
freq_masks: 3 |
|
freq_width: 20 |
|
|
|
|
|
Model Architecture Configurations |
|
|
|
|
|
Each configuration file should describe the model architecture being used for the experiment. For self-supervised pre-training, |
|
we will typically train the encoder of the model and then re-use it for fine-tuning, so the encoder can be configured in the same way |
|
as you would for an ASR model. Note that any ASR model encoder can be used with any of the available pre-training methods, |
|
though, given the same model sizes, we find the best downstream results when using `Conformer <./models.html |
|
|
|
Unlike the encoders, the decoders and corresponding losses will be specific to the self-supervised pre-training, and are small enough that |
|
you can discard them when transferring the model to fine-tuning. |
|
|
|
The most basic method of pre-training we can use is to have the model solve a contrastive task |
|
(this is the approach used in wav2vec 2.0 :cite:`ssl-models-wav2vec2`) |
|
We can define the corresponding decoder and loss configs in the following way for an encoder with stride 4x. |
|
|
|
.. code-block:: yaml |
|
|
|
decoder_out: 128 |
|
|
|
decoder: |
|
_target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction |
|
feat_in: ${model.encoder.d_model} |
|
feat_hidden: 128 |
|
feat_out: ${model.decoder_out} |
|
stride_layers: 0 |
|
|
|
|
|
non_stride_layers: 0 |
|
|
|
loss: |
|
_target_: nemo.collections.asr.losses.ContrastiveLoss |
|
in_dim: ${model.preprocessor.features} |
|
proj_dim: ${model.decoder_out} |
|
combine_time_steps: 4 |
|
quantized_targets: true |
|
codebook_size: 300 |
|
num_groups: 2 |
|
num_negatives: 100 |
|
sample_from_same_utterance_only: true |
|
sample_from_non_masked: false |
|
|
|
Note that in the above example we combine 4 steps from the input spectrogram into a single "token" for the loss, |
|
which corresponds to the encoder stride 4x. We might want to use different values for "combine_time_steps" and encoder stride. |
|
In that case, we will need to add stride layers to decoders to match the strides. We can use the following example config |
|
for a Citrinet encoder with stride 8x. In order to go from stride 8x to 4x, we use a single ``stride_layer`` in the decoder |
|
with ``stride_transpose`` set to True. |
|
|
|
.. code-block:: yaml |
|
|
|
decoder: |
|
_target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction |
|
feat_in: ${model.model_defaults.enc_final} |
|
feat_hidden: 128 |
|
feat_out: ${model.model_defaults.decoder_out_channels} |
|
stride_layers: 1 |
|
|
|
|
|
non_stride_layers: 0 |
|
stride_tranpose: true |
|
|
|
loss: |
|
_target_: nemo.collections.asr.losses.ContrastiveLoss |
|
in_dim: *n_mels |
|
proj_dim: ${model.model_defaults.decoder_out_channels} |
|
combine_time_steps: 4 |
|
quantized_targets: false |
|
sample_from_same_utterance_only: true |
|
sample_from_non_masked: false |
|
|
|
|
|
It can be beneficial to combine contrastive loss with other losses, such as a masked language modeling (mlm) loss |
|
(similar approach to W2V-Bert :cite:`ssl-models-w2v_bert`). |
|
In order to do this, instead of specifying a single ``decoder`` and ``loss`` in the config, we can specify a ``loss_list``, |
|
which can contain any amount of corresponding decoders and losses. For each decoder-loss pair, |
|
we can specify a separate named sub-config, which contains the following fields: |
|
|
|
1. ``decoder`` - The decoder config, specifying a ``target`` class and parameters. |
|
2. ``loss`` - The corresponding loss config, specifying a ``target`` class and parameters. |
|
3. ``loss_alpha`` - A multiplier on this loss (1.0 by default). |
|
4. ``targets_from_loss`` - This parameter specifies which contrastive loss we should extract labels from. It is necessary for any loss which requires labels, if labels aren't present in your manifest. |
|
5. ``transpose_encoded`` - This parameter is used to optionally transpose the encoded features before passing them into this loss. |
|
6. ``start_step`` - The training step at which we should start using this decoder+loss. |
|
7. ``output_from_layer`` - This parameter can be used to specify the name of the layer that we should extract encoded features from to pass into this decoder. If it's not specified or set to null, the final encoder layer is used. |
|
|
|
|
|
The following is an example of a `loss_list` for a combination of contrastive+mlm losses, |
|
where the mlm loss uses targets from the quantization module of the contrastive loss. |
|
|
|
|
|
.. code-block:: yaml |
|
|
|
decoder_out: 128 |
|
|
|
loss_list: |
|
contrastive: |
|
decoder: |
|
_target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction |
|
feat_in: ${model.encoder.d_model} |
|
feat_hidden: 128 |
|
|
|
feat_out: ${model.decoder_out} |
|
stride_layers: 0 |
|
|
|
|
|
non_stride_layers: 0 |
|
loss: |
|
_target_: nemo.collections.asr.losses.ContrastiveLoss |
|
in_dim: ${model.preprocessor.features} |
|
proj_dim: ${model.decoder_out} |
|
combine_time_steps: 4 |
|
quantized_targets: true |
|
|
|
codebook_size: 300 |
|
num_groups: 2 |
|
sample_from_same_utterance_only: true |
|
sample_from_non_masked: false |
|
|
|
mlm: |
|
decoder: |
|
_target_: nemo.collections.asr.modules.ConvASRDecoder |
|
feat_in: ${model.encoder.d_model} |
|
num_classes: 90000 |
|
|
|
loss: |
|
_target_: nemo.collections.asr.losses.MLMLoss |
|
combine_time_steps: 4 |
|
targets_from_loss: "contrastive" |
|
|
|
loss_alpha: 1000. |
|
|
|
transpose_encoded: false |
|
|
|
start_step: 0 |
|
|
|
|
|
|
|
output_from_layer: null |
|
|
|
|
|
|
|
|
|
We can also use other losses which require labels instead of mlm, such as ctc or rnnt loss. Since these losses, unlike mlm, |
|
don't require our targets to have a direct alignment with our steps, we may also want to use set the ``reduce_ids`` parameter of the |
|
contrastive loss to true, to convert any sequence of consecutive equivalent ids to a single occurence of that id. |
|
|
|
An example of a ``loss_list`` consisting of contrastive+ctc loss can look like this: |
|
|
|
.. code-block:: yaml |
|
|
|
decoder_out: 128 |
|
|
|
loss_list: |
|
contr: |
|
decoder: |
|
_target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction |
|
feat_in: ${model.encoder.d_model} |
|
feat_hidden: 128 |
|
feat_out: ${model.decoder_out} |
|
stride_layers: 0 |
|
non_stride_layers: 0 |
|
loss: |
|
_target_: nemo.collections.asr.losses.ContrastiveLoss |
|
in_dim: ${model.preprocessor.features} |
|
proj_dim: ${model.decoder_out} |
|
combine_time_steps: 4 |
|
quantized_targets: true |
|
codebook_size: 300 |
|
num_groups: 2 |
|
sample_from_same_utterance_only: true |
|
sample_from_non_masked: false |
|
reduce_ids: true |
|
|
|
ctc: |
|
decoder: |
|
_target_: nemo.collections.asr.modules.ConvASRDecoder |
|
feat_in: ${model.encoder.d_model} |
|
num_classes: 90000 |
|
loss: |
|
_target_: nemo.collections.asr.losses.CTCLossForSSL |
|
num_classes: 90000 |
|
targets_from_loss: "contr" |
|
start_step: 3000 |
|
|
|
An example of contrastive+rnnt can look like this: |
|
|
|
.. code-block:: yaml |
|
|
|
decoder_out: 128 |
|
|
|
loss_list: |
|
contr: |
|
decoder: |
|
_target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction |
|
feat_in: ${model.encoder.d_model} |
|
feat_hidden: 128 |
|
feat_out: ${model.decoder_out} |
|
stride_layers: 0 |
|
non_stride_layers: 0 |
|
loss: |
|
_target_: nemo.collections.asr.losses.ContrastiveLoss |
|
in_dim: ${model.preprocessor.features} |
|
proj_dim: ${model.decoder_out} |
|
combine_time_steps: 4 |
|
quantized_targets: true |
|
codebook_size: 24 |
|
sample_from_same_utterance_only: true |
|
sample_from_non_masked: false |
|
reduce_ids: true |
|
|
|
rnnt: |
|
decoder: |
|
_target_: nemo.collections.asr.modules.RNNTDecoderJointSSL |
|
decoder: |
|
_target_: nemo.collections.asr.modules.RNNTDecoder |
|
normalization_mode: null |
|
random_state_sampling: false |
|
blank_as_pad: true |
|
vocab_size: 576 |
|
prednet: |
|
pred_hidden: 640 |
|
pred_rnn_layers: 1 |
|
t_max: null |
|
dropout: 0.1 |
|
joint: |
|
_target_: nemo.collections.asr.modules.RNNTJoint |
|
log_softmax: null |
|
preserve_memory: false |
|
experimental_fuse_loss_wer: false |
|
jointnet: |
|
encoder_hidden: 512 |
|
pred_hidden: 640 |
|
joint_hidden: 640 |
|
activation: "relu" |
|
dropout: 0.1 |
|
num_classes: 576 |
|
loss: |
|
_target_: nemo.collections.asr.losses.RNNTLossForSSL |
|
num_classes: 576 |
|
targets_from_loss: "contr" |
|
start_step: 1000 |
|
|
|
|
|
We can also use multiple losses, which use features from different intermediate layers of the encoder as input :cite:`ssl-models-ssl_inter`. |
|
In the following config example, we use contrastive loss + three different mlm losses, which use encoder outputs |
|
respectively from 6th, 12th and final layer. |
|
|
|
.. code-block:: yaml |
|
|
|
decoder_out: 128 |
|
|
|
loss_list: |
|
contr: |
|
decoder: |
|
_target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction |
|
feat_in: ${model.encoder.d_model} |
|
feat_hidden: 128 |
|
feat_out: ${model.decoder_out} |
|
stride_layers: 0 |
|
non_stride_layers: 0 |
|
loss: |
|
_target_: nemo.collections.asr.losses.ContrastiveLoss |
|
in_dim: ${model.preprocessor.features} |
|
proj_dim: ${model.decoder_out} |
|
combine_time_steps: 4 |
|
quantized_targets: true |
|
codebook_size: 300 |
|
sample_from_same_utterance_only: true |
|
sample_from_non_masked: false |
|
loss_alpha: 5. |
|
|
|
mlm: |
|
decoder: |
|
_target_: nemo.collections.asr.modules.ConvASRDecoder |
|
feat_in: ${model.encoder.d_model} |
|
num_classes: 90000 |
|
loss: |
|
_target_: nemo.collections.asr.losses.MLMLoss |
|
combine_time_steps: 4 |
|
targets_from_loss: "contr" |
|
loss_alpha: 1000. |
|
|
|
mlm_2: |
|
decoder: |
|
_target_: nemo.collections.asr.modules.ConvASRDecoder |
|
feat_in: ${model.encoder.d_model} |
|
num_classes: 90000 |
|
loss: |
|
_target_: nemo.collections.asr.losses.MLMLoss |
|
combine_time_steps: 4 |
|
targets_from_loss: "contr" |
|
loss_alpha: 300. |
|
output_from_layer: "layers.5" |
|
transpose_encoded: true |
|
|
|
mlm_3: |
|
decoder: |
|
_target_: nemo.collections.asr.modules.ConvASRDecoder |
|
feat_in: ${model.encoder.d_model} |
|
num_classes: 90000 |
|
loss: |
|
_target_: nemo.collections.asr.losses.MLMLoss |
|
combine_time_steps: 4 |
|
targets_from_loss: "contr" |
|
loss_alpha: 300. |
|
output_from_layer: "layers.11" |
|
transpose_encoded: true |
|
|
|
References |
|
|
|
|
|
.. bibliography:: ../asr_all.bib |
|
:style: plain |
|
:labelprefix: SSL-MODELS |
|
:keyprefix: ssl-models- |