Papers
arxiv:2509.23610

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

Published on Sep 28
Β· Submitted by Kai Li on Oct 1
Authors:
,
,

Abstract

Dolphin, an efficient AVSS method, uses a dual-path lightweight video encoder and a lightweight encoder-decoder separator with global-local attention blocks to achieve high separation quality and significant computational efficiency.

AI-generated summary

Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech and have demonstrated strong separation quality in noisy acoustic environments. However, these methods usually involve a large number of parameters and require high computational cost, which is unacceptable in many applications where speech separation serves as only a preprocessing step for further speech processing. To address this issue, we propose an efficient AVSS method, named Dolphin. For visual feature extraction, we develop DP-LipCoder, a dual-path lightweight video encoder that transforms lip-motion into discrete audio-aligned semantic tokens. For audio separation, we construct a lightweight encoder-decoder separator, in which each layer incorporates a global-local attention (GLA) block to efficiently capture multi-scale dependencies. Experiments on three benchmark datasets showed that Dolphin not only surpassed the current state-of-the-art (SOTA) model in separation quality but also achieved remarkable improvements in efficiency: over 50% fewer parameters, more than 2.4x reduction in MACs, and over 6x faster GPU inference speed. These results indicate that Dolphin offers a practical and deployable solution for high-performance AVSS in real-world scenarios. Our code and demo page are publicly available at http://cslikai.cn/Dolphin/.

Community

Paper submitter

🐬 Dolphin: Efficient Audio-Visual Speech Separation

Author's Introduction

Hi everyone! πŸ‘‹ We're excited to share Dolphin - our work on making audio-visual speech separation actually practical for real-world deployment.

🎯 What We Built

Dolphin separates target speech from noisy audio by leveraging lip movements. The key innovation: achieving SOTA quality while being 6Γ— faster and using 50% fewer parameters than previous methods.

Two main contributions:

  1. DP-LipCoder: A lightweight video encoder using vector quantization to extract discrete lip semantics. We distill knowledge from AV-HuBERT while keeping the model compact.

  2. Global-Local Attention: Multi-scale attention blocks that capture both long-range context (global) and fine-grained details (local heat diffusion) in a single pass - no iterative refinement needed!

πŸ“Š Results Snapshot

On VoxCeleb2:

  • βœ… 16.1 dB SI-SNRi (vs IIANet's 15.8 dB)
  • βœ… 51M params (vs 112M) - 54% reduction
  • βœ… 417G MACs (vs 1009G) - 59% less computation
  • βœ… 0.015s inference (vs 0.100s) - 6.8Γ— speedup

πŸš€ Try It Out

Thanks to the HF team for featuring our work! Feel free to ask questions - we're here to discuss. πŸ™Œ

Paper: arXiv:2509.23610

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.23610 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 1