Orange
/

Speaker-wavLM-pro

Model card Files Files and versions Community

ggmbr commited on Feb 5

Commit

7afcc3b

·

1 Parent(s): 96588a3

Update README.md

Files changed (1) hide show

README.md +38 -2

README.md CHANGED Viewed

@@ -20,7 +20,7 @@ The main difference with classical ASV embeddings is that here only the non-timb
 The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
-See section below for an eplanation on how to compute the non-timbral embeddings.
 # Publication
 Details about the method used to build this model have been published at Interspeech 2024 in the paper entitled
@@ -43,7 +43,43 @@ Gengembre, N., Le Blouch, O., Gendrot, C. (2024) Disentangling prosody and timbr
 ```
 # Usage
-code
 # Limitations
 The fine tuning data used to produce this model (VoxCeleb, VCTK) are mostly in english, which may affect the performance on other languages.

 The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
+See [Usage section] (https://huggingface.co/ggmbr/wnt#usage) below for an eplanation on how to compute the non-timbral embeddings.
 # Publication
 Details about the method used to build this model have been published at Interspeech 2024 in the paper entitled
 ```
 # Usage
+```
+import torch
+import torch.nn as nn
+from transformers.models.wavlm.modeling_wavlm import WavLMPreTrainedModel, WavLMModel
+class TopLayers(nn.Module):
+    def __init__(self, embd_size = 250, top_interm_size = 512):
+        super(TopLayers, self).__init__()
+        self.affine1 = nn.Conv1d(in_channels=2048, out_channels=top_interm_size, kernel_size=1)
+        self.batchnorm1 = nn.BatchNorm1d(num_features=top_interm_size, affine=False, eps=1e-03)
+        self.affine2 = nn.Conv1d(in_channels=top_interm_size, out_channels=embd_size, kernel_size=1)
+        self.batchnorm2 = nn.BatchNorm1d(num_features=embd_size, affine=False, eps=1e-03)
+        self.activation = nn.ReLU(inplace=True)
+    def forward(self, x):
+        out = self.batchnorm1(self.activation(self.affine1(x)))
+        out = self.batchnorm2(self.activation(self.affine2(out)))
+        return nn.functional.normalize(out[:,:,0])
+class EmbeddingsModel(WavLMPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.wavlm = WavLMModel(config)
+        self.top_layers = TopLayers(config.embd_size, config.top_interm_size)
+    def forward(self, input_values):
+        # MVN normalization
+        x_norm = (input_values - input_values.mean(dim=1).unsqueeze(1)) / (input_values.std(dim=1).unsqueeze(1))
+        base_out = self.wavlm(input_values=x_norm, output_hidden_states=False).last_hidden_state
+        v = base_out.var(dim=1).clamp(min=1e-10)
+        x_stats = torch.cat((base_out.mean(dim=1),v.pow(0.5)),dim=1).unsqueeze(dim=2)
+        return self.top_layers(x_stats)
+nt_extractor = EmbeddingsModel("ggmbr/wnt")
+```
+# Evaluations
 # Limitations
 The fine tuning data used to produce this model (VoxCeleb, VCTK) are mostly in english, which may affect the performance on other languages.