BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs
Abstract
BatonVoice framework decouples instruction understanding from speech generation, using an LLM to create vocal feature plans and a specialized TTS model to produce speech, achieving strong performance in controllable and emotional speech synthesis with zero-shot cross-lingual generalization.
The rise of Large Language Models (LLMs) is reshaping multimodel models, with speech synthesis being a prominent application. However, existing approaches often underutilize the linguistic intelligence of these models, typically failing to leverage their powerful instruction-following capabilities. This limitation hinders the model's ability to follow text instructions for controllable Text-to-Speech~(TTS). To address this, we propose a new paradigm inspired by ``operationalism'' that decouples instruction understanding from speech generation. We introduce BatonVoice, a framework where an LLM acts as a ``conductor'', understanding user instructions and generating a textual ``plan'' -- explicit vocal features (e.g., pitch, energy). A separate TTS model, the ``orchestra'', then generates the speech from these features. To realize this component, we develop BatonTTS, a TTS model trained specifically for this task. Our experiments demonstrate that BatonVoice achieves strong performance in controllable and emotional speech synthesis, outperforming strong open- and closed-source baselines. Notably, our approach enables remarkable zero-shot cross-lingual generalization, accurately applying feature control abilities to languages unseen during post-training. This demonstrates that objectifying speech into textual vocal features can more effectively unlock the linguistic intelligence of LLMs.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation (2025)
- UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice (2025)
- SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement (2025)
- Enhancing Speech Large Language Models through Reinforced Behavior Alignment (2025)
- ISSE: An Instruction-Guided Speech Style Editing Dataset And Benchmark (2025)
- FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot (2025)
- MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper