arxiv:2510.02295

VideoNSA: Native Sparse Attention Scales Video Understanding

Published on Oct 2

· Submitted by

Authors:

Abstract

VideoNSA, an adaptation of Native Sparse Attention to video-language models, enhances long-video understanding and temporal reasoning through end-to-end training and a hardware-aware hybrid attention approach.

AI-generated summary

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.

View arXiv page View PDF Project page GitHub 32 Add to collection

Community

Enxin

Paper submitter 2 days ago

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. We introduce VideoNSA, a hardware-aware hybrid sparse attention mechanism. We further conduct massive experiments and analyses of VideoNSA, revealing several important findings.

Github: https://github.com/Espere-1119-Song/VideoNSA
Project Page: https://enxinsong.com/VideoNSA-web/