Papers
arxiv:2509.25131

MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech

Published on Sep 29
ยท Submitted by Wang Chengyao on Sep 30
Authors:
,
,
,
,
,
,
,

Abstract

MGM-Omni is a unified multimodal language model for speech generation and understanding, featuring a dual-track architecture for efficient cross-modal interaction and data-efficient training.

AI-generated summary

We present MGM-Omni, a unified Omni LLM for omni-modal understanding and expressive, long-horizon speech generation. Unlike cascaded pipelines that isolate speech synthesis, MGM-Omni adopts a "brain-mouth" design with a dual-track, token-based architecture that cleanly decouples multimodal reasoning from real-time speech generation. This design enables efficient cross-modal interaction and low-latency, streaming speech generation. For understanding, a unified training strategy coupled with a dual audio encoder design enables long-form audio perception across diverse acoustic conditions. For generation, a chunk-based parallel decoding scheme narrows the text speech token-rate gap, accelerating inference and supporting streaming zero-shot voice cloning with stable timbre over extended durations. Compared to concurrent work, MGM-Omni achieves these capabilities with markedly data-efficient training. Extensive experiments demonstrate that MGM-Omni outperforms existing open source models in preserving timbre identity across extended sequences, producing natural and context-aware speech, and achieving superior long-form audio and omnimodal understanding. MGM-Omni establishes an efficient, end-to-end paradigm for omnimodal understanding and controllable, personalised long-horizon speech generation.

Community

Paper author Paper submitter
โ€ข
edited 1 day ago

MGM-Omni support omni modal understanding, over 60 minutes long speech understanding, over 10 minutes long speech generation and zero-shot voice clone. Moreover, we release a benchmark for evaluating long-form and complex TTS.
๐ŸŒŸ Code: https://github.com/dvlab-research/MGM-Omni
๐ŸŽฎ Demo: https://huggingface.co/spaces/wcy1122/MGM-Omni
๐Ÿค– Model: https://huggingface.co/collections/wcy1122/mgm-omni-6896075e97317a88825032e1
๐Ÿ“Š Benchmark: https://huggingface.co/datasets/wcy1122/Long-TTS-Eval

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.25131 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.25131 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 3