File size: 2,444 Bytes
ae3709c 460e46a fca422e 460e46a ae3709c 460e46a 24ca692 20c3e6b be6c16d 34dbd80 460e46a be6c16d 24ca692 1f81817 fca422e 1f81817 20c3e6b 460e46a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
---
language:
- en
- de
- es
- it
- nl
- pt
- pl
- ro
- sv
- da
- fi
- hu
- el
- fr
- ru
- uk
- tr
- ar
- hi
- jp
- ko
- zh
- vi
- la
- ha
- sw
- yo
- wo
library: xvasynth
tags:
- emotion
- audio
- text-to-speech
- speech-to-speech
- voice conversion
- tts
pipeline_tag: text-to-speech
---
GitHub project, inference Windows/Electron app: https://github.com/DanRuta/xVA-Synth
Fine-tuning app: https://github.com/DanRuta/xva-trainer
The base model for training other [🤗 xVASynth's](https://huggingface.co/spaces/Pendrokar/xVASynth-TTS) "xVAPitch" type models (v3). Model itself is used by the xVATrainer TTS model training app and not for inference. All created by Dan ["@dr00392"](https://huggingface.co/dr00392) Ruta.
`The v3 model now uses a slightly custom tweaked VITS/YourTTS model. Tweaks including larger capacity, bigger lang embedding, custom symbol set (a custom spec of ARPAbet with some more phonemes to cover other languages), and I guess a different training script.` - Dan Ruta
When used in xVASynth editor, it is an American Adult Male voice. Default pacing is too fast and has to be adjusted.
xVAPitch_5820651 model sample: <audio controls>
<source src="https://huggingface.co/Pendrokar/xvapitch/resolve/main/xVAPitch_5820651.wav?download=true" type="audio/wav">
Your browser does not support the audio element.
</audio>
There are hundreds of fine-tuned models on the web. But most of them use non-permissive datasets.
## xVASynth Editor v3 walkthrough video ▶:
[](https://www.youtube.com/watch?v=5u4xpI-cAd8)
## xVATrainer v1 walkthrough video ▶:
[](https://www.youtube.com/watch?v=PXv_SeTWk2M)
Papers:
- VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech - https://arxiv.org/abs/2106.06103
- YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone - https://arxiv.org/abs/2112.02418
Referenced papers within code:
- Multi-head attention with Relative Positional embedding - https://arxiv.org/pdf/1809.04281.pdf
- Transformer with Relative Potional Encoding- https://arxiv.org/abs/1803.02155
- SDP - https://arxiv.org/pdf/2106.06103.pdf
- Spline Flow - https://arxiv.org/abs/1906.04032
Used datasets: Unknown/Non-permissiable data |