Abstract
MGM-Omni is a unified multimodal language model for speech generation and understanding, featuring a dual-track architecture for efficient cross-modal interaction and data-efficient training.
We present MGM-Omni, a unified Omni LLM for omni-modal understanding and expressive, long-horizon speech generation. Unlike cascaded pipelines that isolate speech synthesis, MGM-Omni adopts a "brain-mouth" design with a dual-track, token-based architecture that cleanly decouples multimodal reasoning from real-time speech generation. This design enables efficient cross-modal interaction and low-latency, streaming speech generation. For understanding, a unified training strategy coupled with a dual audio encoder design enables long-form audio perception across diverse acoustic conditions. For generation, a chunk-based parallel decoding scheme narrows the text speech token-rate gap, accelerating inference and supporting streaming zero-shot voice cloning with stable timbre over extended durations. Compared to concurrent work, MGM-Omni achieves these capabilities with markedly data-efficient training. Extensive experiments demonstrate that MGM-Omni outperforms existing open source models in preserving timbre identity across extended sequences, producing natural and context-aware speech, and achieving superior long-form audio and omnimodal understanding. MGM-Omni establishes an efficient, end-to-end paradigm for omnimodal understanding and controllable, personalised long-horizon speech generation.
Community
MGM-Omni support omni modal understanding, over 60 minutes long speech understanding, over 10 minutes long speech generation and zero-shot voice clone. Moreover, we release a benchmark for evaluating long-form and complex TTS.
๐ Code: https://github.com/dvlab-research/MGM-Omni
๐ฎ Demo: https://huggingface.co/spaces/wcy1122/MGM-Omni
๐ค Model: https://huggingface.co/collections/wcy1122/mgm-omni-6896075e97317a88825032e1
๐ Benchmark: https://huggingface.co/datasets/wcy1122/Long-TTS-Eval
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Qwen3-Omni Technical Report (2025)
- UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice (2025)
- X-Streamer: Unified Human World Modeling with Audiovisual Interaction (2025)
- VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning (2025)
- MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis (2025)
- WAVE: Learning Unified&Versatile Audio-Visual Embeddings with Multimodal LLM (2025)
- DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper