Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention
Abstract
Dolphin, an efficient AVSS method, uses a dual-path lightweight video encoder and a lightweight encoder-decoder separator with global-local attention blocks to achieve high separation quality and significant computational efficiency.
Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech and have demonstrated strong separation quality in noisy acoustic environments. However, these methods usually involve a large number of parameters and require high computational cost, which is unacceptable in many applications where speech separation serves as only a preprocessing step for further speech processing. To address this issue, we propose an efficient AVSS method, named Dolphin. For visual feature extraction, we develop DP-LipCoder, a dual-path lightweight video encoder that transforms lip-motion into discrete audio-aligned semantic tokens. For audio separation, we construct a lightweight encoder-decoder separator, in which each layer incorporates a global-local attention (GLA) block to efficiently capture multi-scale dependencies. Experiments on three benchmark datasets showed that Dolphin not only surpassed the current state-of-the-art (SOTA) model in separation quality but also achieved remarkable improvements in efficiency: over 50% fewer parameters, more than 2.4x reduction in MACs, and over 6x faster GPU inference speed. These results indicate that Dolphin offers a practical and deployable solution for high-performance AVSS in real-world scenarios. Our code and demo page are publicly available at http://cslikai.cn/Dolphin/.
Community
π¬ Dolphin: Efficient Audio-Visual Speech Separation
Author's Introduction
Hi everyone! π We're excited to share Dolphin - our work on making audio-visual speech separation actually practical for real-world deployment.
π― What We Built
Dolphin separates target speech from noisy audio by leveraging lip movements. The key innovation: achieving SOTA quality while being 6Γ faster and using 50% fewer parameters than previous methods.
Two main contributions:
DP-LipCoder: A lightweight video encoder using vector quantization to extract discrete lip semantics. We distill knowledge from AV-HuBERT while keeping the model compact.
Global-Local Attention: Multi-scale attention blocks that capture both long-range context (global) and fine-grained details (local heat diffusion) in a single pass - no iterative refinement needed!
π Results Snapshot
On VoxCeleb2:
- β 16.1 dB SI-SNRi (vs IIANet's 15.8 dB)
- β 51M params (vs 112M) - 54% reduction
- β 417G MACs (vs 1009G) - 59% less computation
- β 0.015s inference (vs 0.100s) - 6.8Γ speedup
π Try It Out
- π€ Demo: Hugging Face Space
- π» Code: GitHub
- π¦ Models: Hugging Face Hub
Thanks to the HF team for featuring our work! Feel free to ask questions - we're here to discuss. π
Paper: arXiv:2509.23610
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- From Coarse to Fine: Recursive Audio-Visual Semantic Enhancement for Speech Separation (2025)
- Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion (2025)
- AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition (2025)
- Audio-Visual Feature Synchronization for Robust Speech Enhancement in Hearing Aids (2025)
- Frequency-Domain Decomposition and Recomposition for Robust Audio-Visual Segmentation (2025)
- PerformSinger: Multimodal Singing Voice Synthesis Leveraging Synchronized Lip Cues from Singing Performance Videos (2025)
- LORT: Locally Refined Convolution and Taylor Transformer for Monaural Speech Enhancement (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper