VideoNSA: Native Sparse Attention Scales Video Understanding
Abstract
VideoNSA, an adaptation of Native Sparse Attention to video-language models, enhances long-video understanding and temporal reasoning through end-to-end training and a hardware-aware hybrid attention approach.
Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.
Community
Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. We introduce VideoNSA, a hardware-aware hybrid sparse attention mechanism. We further conduct massive experiments and analyses of VideoNSA, revealing several important findings.
Github: https://github.com/Espere-1119-Song/VideoNSA
Project Page: https://enxinsong.com/VideoNSA-web/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding (2025)
- Dense Video Understanding with Gated Residual Tokenization (2025)
- From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding (2025)
- An Empirical Study on How Video-LLMs Answer Video Questions (2025)
- Bidirectional Sparse Attention for Faster Video Diffusion Training (2025)
- Video Panels for Long Video Understanding (2025)
- VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper