Papers
arxiv:2510.02295

VideoNSA: Native Sparse Attention Scales Video Understanding

Published on Oct 2
· Submitted by EnxinSong on Oct 3
Authors:
,
,
,
,
,
,
,

Abstract

VideoNSA, an adaptation of Native Sparse Attention to video-language models, enhances long-video understanding and temporal reasoning through end-to-end training and a hardware-aware hybrid attention approach.

AI-generated summary

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.

Community

Paper submitter

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. We introduce VideoNSA, a hardware-aware hybrid sparse attention mechanism. We further conduct massive experiments and analyses of VideoNSA, revealing several important findings.

Github: https://github.com/Espere-1119-Song/VideoNSA
Project Page: https://enxinsong.com/VideoNSA-web/

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.02295 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.02295 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.