{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "UFjxB7vK2Prz",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# 1. Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ysLvHZRO4uN2",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Install NeMo library. If you are running locally (rather than on Google Colab), comment out the below lines\n",
"# and instead follow the instructions at https://github.com/NVIDIA/NeMo#Installation\n",
"BRANCH = 'r1.17.0'\n",
"!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "O-HRFHBb_RDH",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Download local version of NeMo scripts. If you are running locally and want to use your own local NeMo code,\n",
"# comment out the below lines and set NEMO_DIR to your local path.\n",
"NEMO_DIR = 'nemo'\n",
"!git clone https://github.com/NVIDIA/NeMo.git $NEMO_DIR"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sgqwl2ycC1Sh",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# 2. Introduction to TTS"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UYHRrdrXHe28",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"This notebook provides a high level overview of text-to-speech (TTS). It will cover high level concepts and discuss each component in a standard TTS pipeline, providing relevant examples and code snippets using [NeMo](https://github.com/NVIDIA/NeMo)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8meIFtgWHmxt",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# 3. What is TTS?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6t8rK7L2HpZd",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"**Text-to-speech**, also known as **TTS** or **speech synthesis**, refers to a system by which a computer reads text aloud. Typically the synthesized audio resembles a realistic human voice.\n",
"\n",
"Most TTS models sound like the voice of the speaker whose audio it is trained on. Though some more recently developed algorithms have the potential to sound like real speakers they were not trained on, or sound like entirely new voices.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MC9oz1kiHqsW",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# 4. The TTS pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ODp1OnA0SYdF",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Modern TTS systems are fairly complex, with an end to end pipeline consisting of several components that each require their own model or heuristics.\n",
"\n",
"A standard pipeline might look like:\n",
"\n",
"1. **Text Normalization**: Converting raw text to spoken text (eg. \"Mr.\" → \"mister\").
\n",
"2. **Grapheme to Phoneme conversion (G2P)**: Convert basic units of text (ie. graphemes/characters) to basic units of spoken language (ie. phonemes).\n",
"3. **Spectrogram Synthesis**: Convert text/phonemes into a spectrogram.\n",
"4. **Audio Synthesis**: Convert spectrogram into audio. Also known as **spectrogram inversion**. Models which do this are called **vocoders**.\n",
"\n",
"While this is the most common structure, there may be fewer or additional steps depending on the use case. For example, some languages do not require G2P and can instead rely on the model to convert raw text/graphemes to spectrogram.\n",
"\n",
"
Normalization Type | \n", "Input | \n", "Output | \n", "|
---|---|---|---|
Abbreviations | \n", "Mr. | \n", "mister | \n", " |
Acronyms | \n", "TTS | \n", "text to speech | \n", " |
Numbers | \n", "42 | \n", "forty two | \n", "|
Decimals | \n", "1.2 | \n", "one point two | \n", "|
Roman Numerals | \n", "VII | \n", "seventh | \n", "|
Cardinal Directions | \n", "N E S W | \n", "north east south west | \n", "|
URL | \n", "www.github.com | \n", "w w w dot github dot com | \n", "