arxiv:2509.18010

Cross-Attention is Half Explanation in Speech-to-Text Models

Published on Sep 22

· Submitted by

Sara Papi on Sep 23

Upvote

Authors:

Sara Papi ,

Marco Gaido ,

Matteo Negri ,

Luisa Bentivogli

Abstract

Cross-attention in speech-to-text models aligns moderately with saliency-based explanations but captures only a portion of input relevance and decoder attention.

AI-generated summary

Cross-attention is a core mechanism in encoder-decoder architectures, widespread in many fields, including speech-to-text (S2T) processing. Its scores have been repurposed for various downstream applications--such as timestamp estimation and audio-text alignment--under the assumption that they reflect the dependencies between input speech representation and the generated text. While the explanatory nature of attention mechanisms has been widely debated in the broader NLP literature, this assumption remains largely unexplored within the speech domain. To address this gap, we assess the explanatory power of cross-attention in S2T models by comparing its scores to input saliency maps derived from feature attribution. Our analysis spans monolingual and multilingual, single-task and multi-task models at multiple scales, and shows that attention scores moderately to strongly align with saliency-based explanations, particularly when aggregated across heads and layers. However, it also shows that cross-attention captures only about 50% of the input relevance and, in the best case, only partially reflects how the decoder attends to the encoder's representations--accounting for just 52-75% of the saliency. These findings uncover fundamental limitations in interpreting cross-attention as an explanatory proxy, suggesting that it offers an informative yet incomplete view of the factors driving predictions in S2T models.

View arXiv page View PDF Add to collection

Community

spapi

Paper author Paper submitter 13 days ago

•

edited 13 days ago

This paper investigates whether cross-attention in speech-to-text models can serve as a faithful explanation of model behavior. Through large-scale analysis against state-of-the-art saliency maps, we show that cross-attention captures only part of the explanation, cautioning against its sole use for interpretability and motivating the integration of complementary explanation methods.

librarian-bot

13 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.18010 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.18010 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.18010 in a Space README.md to link it from this page.