--- library_name: transformers tags: - unsloth - text-to-audio - s2s license: cc-by-sa-4.0 datasets: - KandirResearch/Speech2Speech language: - en base_model: - OuteAI/OuteTTS-0.3-500M pipeline_tag: text-to-audio --- # CiSiMi: A Text-to-Speech TTS Model [![Buy Me A Coffee](https://img.shields.io/badge/Ko--fi-Support%20My%20Work-FF5E5B?style=for-the-badge&logo=ko-fi&logoColor=white)](https://ko-fi.com/lyte) [![Dataset](https://img.shields.io/badge/Dataset-KandirResearch/Speech2Speech-blue)](https://huggingface.co/datasets/KandirResearch/Speech2Speech) [![Model](https://img.shields.io/badge/Model-KandirResearch/CiSiMi--v0.1-green)](https://huggingface.co/KandirResearch/CiSiMi-v0.1) [![Demo](https://img.shields.io/badge/Demo-KandirResearch/CiSiMi--At--Home-orange)](https://huggingface.co/spaces/KandirResearch/CiSiMi-At-Home) ## Overview CiSiMi is an early prototype of a text-to-audio model that can process text inputs and respond with both text and audio. Built for resource-constrained environments, it's designed to run efficiently on CPU using llama.cpp, making advanced speech synthesis accessible even without powerful GPUs. *"Being GPU poor and slightly disappointed with the csm release and my inability to run it, having to wait for time it takes me to run an ASR+LLM+TTS combo, I decided to ask Mom and Mom gave me CiSiMi At Home!"* This project demonstrates the power of open-source tools to create accessible speech technology. While still in its early stages, it represents a step toward democratizing advanced text-to-audio capabilities. ## Technical Details ### Model Specifications - **Architecture**: Based on OuteTTS-0.3-500M - **Languages**: English - **Pipeline**: Text-to-audio - **Parameters**: 500M - **Training Dataset Size**: ~15k samples - **Future Goals**: Scale to 200k-500k dataset with multi-turn conversation using both a 500M and a 1B parameter model variants, plus adding streaming for realtime. ### Training Methodology 1. **Dataset Preparation**: - Started with [gruhit-patel/alpaca_speech_instruct](https://huggingface.co/datasets/gruhit-patel/alpaca_speech_instruct) - Cleaned by removing code, mathematical expressions, and non-English content - Filtered to keep only entries with input+output texts of 256 tokens or less 2. **Audio Generation**: - Converted text outputs to speech using [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) - Verified each audio generation using [OpenAI Whisper](https://github.com/openai/whisper) - Published the resulting dataset as [KandirResearch/Speech2Speech](https://huggingface.co/datasets/KandirResearch/Speech2Speech) 3. **Model Training**: - Preprocessed dataset using modified OuteTTS methodology ([training details](https://github.com/edwko/OuteTTS/blob/8eb0fa369df6f3c062f7084ddc33d10bc28992be/examples/training/OuteTTS-0.3/train.md)) - Fine-tuned [OuteAI/OuteTTS-0.3-500M](https://huggingface.co/OuteAI/OuteTTS-0.3-500M) using Unsloth SFT - Trained for 6 epochs reaching a loss of 2.27 as a proof of concept - ~Trained for 3 epochs reaching a loss of 2.42 as a proof of concept~ ## Usage Guide ### Sample ``` Explain to me how gravity works! ``` ### Installation ```bash pip install outetts llama-cpp-python --upgrade pip install huggingface_hub sounddevice ``` ### Implementation ```python import torch import outetts import numpy as np from huggingface_hub import hf_hub_download from outetts.wav_tokenizer.audio_codec import AudioCodec from outetts.version.v2.prompt_processor import PromptProcessor from outetts.version.playback import ModelOutput # Download the model model_path = hf_hub_download( repo_id="KandirResearch/CiSiMi-v0.1", filename="unsloth.Q8_0.gguf", ) # Configure the model model_config = outetts.GGUFModelConfig_v2( model_path=model_path, tokenizer_path="KandirResearch/CiSiMi-v0.1", ) # Initialize components interface = outetts.InterfaceGGUF(model_version="0.3", cfg=model_config) audio_codec = AudioCodec() prompt_processor = PromptProcessor("KandirResearch/CiSiMi-v0.1") device = torch.device("cuda" if torch.cuda.is_available() else "cpu") gguf_model = interface.get_model() # Helper function to extract audio from tokens def get_audio(tokens): outputs = prompt_processor.extract_audio_from_tokens(tokens) if not outputs: return None audio_tensor = audio_codec.decode(torch.tensor([[outputs]], dtype=torch.int64).to(device)) return ModelOutput(audio_tensor, audio_codec.sr) # Helper function to clean text output def extract_text_from_tts_output(tts_output): text = "" for line in tts_output.strip().split('\n'): if '<|audio_end|>' in line or '<|im_end|>' in line: continue if '<|' in line: word = line.split('<|')[0].strip() if word: text += word + " " else: text += line.strip() + " " return text.strip() # Generate response function def generate_response(instruction): prompt = f"<|im_start|>\nInstructions:\n{instruction}\n<|im_end|>\nAnswer:\n" gen_cfg = outetts.GenerationConfig( text=prompt, temperature=0.6, repetition_penalty=1.1, max_length=4096, speaker=None ) input_ids = prompt_processor.tokenizer.encode(prompt) tokens = gguf_model.generate(input_ids, gen_cfg) output_text = prompt_processor.tokenizer.decode(tokens, skip_special_tokens=False) if "<|audio_end|>" in output_text: first_part, _, _ = output_text.partition("<|audio_end|>") if "<|audio_end|>\n<|im_end|>\n" not in first_part: first_part += "<|audio_end|>\n<|im_end|>\n" extracted_text = extract_text_from_tts_output(first_part) audio_start_pos = first_part.find("<|audio_start|>\n") + len("<|audio_start|>\n") audio_end_pos = first_part.find("<|audio_end|>\n<|im_end|>\n") + len("<|audio_end|>\n<|im_end|>\n") if audio_start_pos >= len("<|audio_start|>\n") and audio_end_pos > audio_start_pos: audio_tokens_text = first_part[audio_start_pos:audio_end_pos] audio_tokens = prompt_processor.tokenizer.encode(audio_tokens_text) audio_output = get_audio(audio_tokens) if audio_output is not None and hasattr(audio_output, 'audio') and audio_output.audio is not None: audio_numpy = audio_output.audio.cpu().numpy() if audio_numpy.ndim > 1: audio_numpy = audio_numpy.squeeze() return extracted_text, (audio_output.sr, audio_numpy) return output_text, None # Example usage question = "What is the meaning of life?" response_text, response_audio = generate_response(question) print(response_text) # Play audio if available if response_audio is not None: if "ipykernel" in sys.modules: from IPython.display import display, Audio display(Audio(response_audio[1], rate=response_audio[0], autoplay=True)) else: import sounddevice as sd sd.play(response_audio[1], samplerate=response_audio[0]) sd.wait() ``` ## Limitations & Future Work This early prototype has several areas for improvement: - Limited training data (~15k samples) - Basic prompt/chat template structure - Opportunity to optimize training hyperparameters - Potential for multi-turn conversation capabilities **Potential Limitation**: This type of model quickly fills up context window, making smaller models generally more practical for implementation. ## Acknowledgments & Citations This model builds on the following open-source projects: 1. [OuteAI/OuteTTS-0.3-500M](https://huggingface.co/OuteAI/OuteTTS-0.3-500M) - Base model 2. [gruhit-patel/alpaca_speech_instruct](https://huggingface.co/datasets/gruhit-patel/alpaca_speech_instruct) - Initial dataset 3. [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) - TTS generation 4. [OpenAI Whisper](https://github.com/openai/whisper) - Speech verification 5. [Unsloth](https://github.com/unslothai/unsloth) - Training optimization