Pendrokar
/

xvapitch

speech-to-speech

voice conversion

Model card Files Files and versions Community

xvapitch / README.md

Pendrokar's picture

videos

24ca692 verified 3 months ago

|

history blame contribute delete

2.44 kB

	---
	language:
	- en
	- de
	- es
	- it
	- nl
	- pt
	- pl
	- ro
	- sv
	- da
	- fi
	- hu
	- el
	- fr
	- ru
	- uk
	- tr
	- ar
	- hi
	- jp
	- ko
	- zh
	- vi
	- la
	- ha
	- sw
	- yo
	- wo
	library: xvasynth
	tags:
	- emotion
	- audio
	- text-to-speech
	- speech-to-speech
	- voice conversion
	- tts
	pipeline_tag: text-to-speech
	---

	GitHub project, inference Windows/Electron app: https://github.com/DanRuta/xVA-Synth

	Fine-tuning app: https://github.com/DanRuta/xva-trainer


	The base model for training other [🤗 xVASynth's](https://huggingface.co/spaces/Pendrokar/xVASynth-TTS) "xVAPitch" type models (v3). Model itself is used by the xVATrainer TTS model training app and not for inference. All created by Dan ["@dr00392"](https://huggingface.co/dr00392) Ruta.

	`The v3 model now uses a slightly custom tweaked VITS/YourTTS model. Tweaks including larger capacity, bigger lang embedding, custom symbol set (a custom spec of ARPAbet with some more phonemes to cover other languages), and I guess a different training script.` - Dan Ruta

	When used in xVASynth editor, it is an American Adult Male voice. Default pacing is too fast and has to be adjusted.

	xVAPitch_5820651 model sample: <audio controls>
	<source src="https://huggingface.co/Pendrokar/xvapitch/resolve/main/xVAPitch_5820651.wav?download=true" type="audio/wav">
	Your browser does not support the audio element.
	</audio>

	There are hundreds of fine-tuned models on the web. But most of them use non-permissive datasets.

	## xVASynth Editor v3 walkthrough video ▶:
	[![Video](https://img.youtube.com/vi/5u4xpI-cAd8/hqdefault.jpg)](https://www.youtube.com/watch?v=5u4xpI-cAd8)

	## xVATrainer v1 walkthrough video ▶:
	[![Video](https://img.youtube.com/vi/PXv_SeTWk2M/hqdefault.jpg)](https://www.youtube.com/watch?v=PXv_SeTWk2M)

	Papers:
	- VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech - https://arxiv.org/abs/2106.06103
	- YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone - https://arxiv.org/abs/2112.02418

	Referenced papers within code:
	- Multi-head attention with Relative Positional embedding - https://arxiv.org/pdf/1809.04281.pdf
	- Transformer with Relative Potional Encoding- https://arxiv.org/abs/1803.02155
	- SDP - https://arxiv.org/pdf/2106.06103.pdf
	- Spline Flow - https://arxiv.org/abs/1906.04032

	Used datasets: Unknown/Non-permissiable data