NariLabs commited on
Commit
d8a173b
·
verified ·
1 Parent(s): 0b9bbdd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +268 -3
README.md CHANGED
@@ -1,3 +1,268 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - Text-to-Speech
7
+ ---
8
+ # Dia
9
+
10
+ [![Join us on Discord](https://img.shields.io/badge/Discord-Join%20Chat-7289DA?logo=discord)](https://discord.gg/pgdB5YRe)
11
+ [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
12
+
13
+ <!-- Add other badges here: PyPI version, Hugging Face model, etc. -->
14
+
15
+ Dia is a 1.6B parameter speech generation model created by Nari Labs. Dia can generate highly realistic dialogue from a transcript. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.
16
+
17
+ [Demo Page](https://yummy-fir-7a4.notion.site/dia-demo) comparing our model to [ElevenLabs Studio](https://elevenlabs.io/studio) and [Sesame CSM-1B](https://github.com/SesameAILabs/csm).
18
+
19
+ ## Features
20
+
21
+ - **Python Library:** Core functionalities accessible via `import nari_tts`.
22
+ - **Command-Line Interface (CLI):** Generate audio from text using `scripts/infer.py`.
23
+ - **Gradio Web UI:** Interactive demo interface via `app/app.py`.
24
+ - **Hugging Face Hub Integration:** Load models directly from the Hub (`buttercrab/nari-tts` placeholder).
25
+ - **Docker Support:** Run the CLI or Gradio app in isolated containers.
26
+ - `Dockerfile.cli`
27
+ - `Dockerfile.app`
28
+ - `docker-compose.yml` (for Gradio app)
29
+ - **Audio Prompting:** Guide speech style using an audio prompt (optional).
30
+
31
+ ## Installation
32
+
33
+ ### Prerequisites
34
+
35
+ - Python >= 3.10
36
+ - PyTorch >= 2.6.0 (check compatibility with your CUDA version if applicable)
37
+ - `uv` (optional, for faster dependency installation: `pip install uv`)
38
+
39
+ ### Installing the Library
40
+
41
+ **1. From Source (Recommended for Development):**
42
+
43
+ Clone the repository and install in editable mode:
44
+
45
+ ```bash
46
+ # Clone the repository (replace with your actual repo URL)
47
+ git clone https://github.com/your-username/nari-dialogue.git
48
+ cd nari-dialogue
49
+
50
+ # Install using pip (uses pyproject.toml)
51
+ pip install -e .
52
+
53
+ # Or install using uv (faster)
54
+ uv pip install -e .
55
+ ```
56
+
57
+ **2. From PyPI (Once Published):**
58
+
59
+ ```bash
60
+ pip install nari-tts
61
+ ```
62
+
63
+ **3. Installing Dependencies Separately:**
64
+
65
+ You can install dependencies using `uv` or `pip` with the provided files:
66
+
67
+ ```bash
68
+ # Using uv (recommended)
69
+ uv pip install -r requirements.txt
70
+
71
+ # Using pip
72
+ pip install -r requirements.txt
73
+ ```
74
+
75
+ ## Usage
76
+
77
+ ### As a Python Library
78
+
79
+ ```python
80
+ import torch
81
+ import dac
82
+ import soundfile as sf
83
+ from pathlib import Path
84
+ from nari_tts import load_model_from_hub, generate, codebook_to_audio
85
+
86
+ # --- Configuration ---
87
+ REPO_ID = \"buttercrab/nari-tts\" # Replace with actual Hub ID
88
+ DAC_MODEL_TYPE = \"44khz\"
89
+ OUTPUT_FILENAME = \"library_output.wav\"
90
+ INPUT_TEXT = \"This audio was generated using the Nari TTS library.\"
91
+
92
+ # --- Device ---
93
+ device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")
94
+ print(f\"Using device: {device}\")
95
+
96
+ # --- Load Models ---
97
+ print(f\"Loading Nari model from {REPO_ID}...\")
98
+ model, config = load_model_from_hub(REPO_ID, device=device)
99
+
100
+ print(f\"Loading DAC model ({DAC_MODEL_TYPE})...\")
101
+ dac_model_path = dac.utils.download(model_type=DAC_MODEL_TYPE)
102
+ dac_model = dac.DAC.load(dac_model_path).to(device)
103
+ dac_model.eval()
104
+ dac_sample_rate = dac_model.sample_rate
105
+
106
+ # --- Generate ---
107
+ print(f\"Generating audio for: '{INPUT_TEXT}'\")
108
+ generated_codes = generate(
109
+ model=model,
110
+ config=config,
111
+ text=INPUT_TEXT,
112
+ max_tokens=config.data.audio_length, # Or specify a number
113
+ cfg_scale=3.0,
114
+ temperature=1.0,
115
+ top_p=0.95,
116
+ use_cfg_filter=True,
117
+ device=device,
118
+ cfg_filter_top_k=50,
119
+ # audio_prompt_path=\"path/to/prompt.wav\", # Optional
120
+ dac_model=dac_model
121
+ )
122
+
123
+ # --- Decode and Save ---
124
+ if generated_codes.numel() > 0:
125
+ print(\"Converting codes to audio...\")
126
+ audio_tensor = codebook_to_audio(
127
+ generated_codes=generated_codes.transpose(0, 1), # Needs [C, T]
128
+ dac_model=dac_model,
129
+ delay_pattern=config.data.delay_pattern,
130
+ C=config.data.channels,
131
+ )
132
+ audio_np = audio_tensor.cpu().float().numpy().squeeze()
133
+ sf.write(OUTPUT_FILENAME, audio_np, dac_sample_rate)
134
+ print(f\"Audio saved to {OUTPUT_FILENAME}\")
135
+ else:
136
+ print(\"Generation failed.\")
137
+
138
+ ```
139
+
140
+ ### Command-Line Interface (CLI)
141
+
142
+ The CLI script `scripts/infer.py` allows generation from the terminal.
143
+
144
+ **Basic Usage (Loading from Hub):**
145
+
146
+ ```bash
147
+ python scripts/infer.py \"Your input text goes here.\" \
148
+ --repo-id buttercrab/nari-tts \
149
+ --output generated_speech.wav
150
+ ```
151
+
152
+ **Loading from Local Files:**
153
+
154
+ ```bash
155
+ python scripts/infer.py \"Text for local model.\" \
156
+ --local-paths \
157
+ --config path/to/your/config.json \
158
+ --checkpoint path/to/your/nari_v0.pth \
159
+ --output local_output.wav
160
+ ```
161
+
162
+ **With Audio Prompt:**
163
+
164
+ ```bash
165
+ python scripts/infer.py \"Generate speech like this prompt.\" \
166
+ --repo-id buttercrab/nari-tts \
167
+ --audio-prompt path/to/your/prompt.wav \
168
+ --output prompted_output.wav
169
+ ```
170
+
171
+ **See all options:**
172
+
173
+ ```bash
174
+ python scripts/infer.py --help
175
+ ```
176
+
177
+ ### Gradio Web UI
178
+
179
+ The Gradio app provides an interactive interface.
180
+
181
+ **Running Locally:**
182
+
183
+ ```bash
184
+ # Load from Hub (replace repo-id)
185
+ python app/app.py --repo-id buttercrab/nari-tts
186
+
187
+ # Load from local files
188
+ python app/app.py --local-paths --config path/to/config.json --checkpoint path/to/checkpoint.pth
189
+ ```
190
+
191
+ Access the UI in your browser (usually at `http://127.0.0.1:7860`).
192
+
193
+ ### Docker
194
+
195
+ **1. Build the Images:**
196
+
197
+ ```bash
198
+ # Build CLI image
199
+ docker build -t nari-tts-cli:latest -f Dockerfile.cli .
200
+
201
+ # Build App image
202
+ docker build -t nari-tts-app:latest -f Dockerfile.app .
203
+ ```
204
+
205
+ **2. Run CLI Container:**
206
+
207
+ ```bash
208
+ docker run --rm -v \"$(pwd)/output:/app/output\" nari-tts-cli:latest \
209
+ \"Generating audio inside a Docker container.\" \
210
+ --repo-id buttercrab/nari-tts \
211
+ --output /app/output/docker_output.wav
212
+
213
+ # Mount ~/.cache/huggingface to reuse downloads:
214
+ # docker run --rm -v \"$(pwd)/output:/app/output\" -v \"~/.cache/huggingface:/root/.cache/huggingface\" nari-tts-cli:latest ...
215
+ ```
216
+
217
+ _(Note: Adjust volume mounts as needed for your OS and paths.)_
218
+
219
+ **3. Run Gradio App Container (using Docker Compose):**
220
+
221
+ This is the easiest way to run the Gradio app with proper port mapping and volume mounts.
222
+
223
+ ```bash
224
+ docker compose up
225
+ ```
226
+
227
+ This will build the `nari-tts-app` image if it doesn't exist and start the container.
228
+ Access the UI at `http://localhost:7860`.
229
+
230
+ To stop the service:
231
+
232
+ ```bash
233
+ docker compose down
234
+ ```
235
+
236
+ ## Model
237
+
238
+ _(Add details about the model architecture, training data, expected quality, and link to the Hugging Face model card here)_.
239
+
240
+ ## Disclaimer
241
+
242
+ This project offers a high-fidelity speech generation model intended solely for research and educational use. The following uses are **strictly forbidden**:
243
+
244
+ - **Identity Misuse**: Do not produce audio resembling real individuals without permission.
245
+ - **Deceptive Content**: Do not use this model to generate misleading content (e.g. fake news)
246
+ - **Illegal or Malicious Use**: Do not use this model for activities that are illegal or intended to cause harm.
247
+
248
+ By using this model, you agree to uphold relevant legal standards and ethical responsibilities. We **are not responsible** for any misuse and firmly oppose any unethical usage of this technology.
249
+
250
+ ## License
251
+
252
+ This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
253
+
254
+ ## TODO / Future Work
255
+
256
+ - Optimize inference speed.
257
+ - Add quantization for memory efficiency.
258
+
259
+ ## Contributing
260
+
261
+ We are a tiny team of 1 full-time and 1 part-time research-engineers. We are extra-welcome to any contributions!
262
+ Join our [Discord Server](https://discord.gg/pgdB5YRe) for discussions.
263
+
264
+ ## Acknowledgements
265
+
266
+ - We thank the [Google TPU Research Cloud program](https://sites.research.google/trc/about/) for providing computation resources.
267
+ - Our work was heavily inspired by [SoundStorm](https://arxiv.org/abs/2305.09636) and [Parakeet](https://jordandarefsky.com/blog/2024/parakeet/).
268
+ - "Nari" is a pure Korean word for lily.