arxiv:2509.24695

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Published on Sep 29

· Submitted by

Yuyang on Sep 30

NVIDIA

Upvote

Authors:

Junsong Chen ,

Muyang Li ,

Abstract

SANA-Video, a small diffusion model, efficiently generates high-resolution, high-quality videos with strong text-video alignment using linear attention and a constant-memory KV cache, achieving competitive performance at a lower cost and faster speed.

AI-generated summary

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.

View arXiv page View PDF Project page GitHub 4.53k Add to collection

Community

Yuyang-z

Paper submitter 2 days ago

•

edited 2 days ago

🚀 SANA-Video: Efficient AI Video Generation is Here!
It's time for a new SANA family member! We introduce SANA-Video, a small diffusion model engineered to shatter the limits of video speed, making high-quality, short and long content truly deployable on consumer hardware.

Core Efficiency & Impact
Architecture: Pure Linear Diffusion Transformer + Block Linear KV Cache = Unmatched Efficiency.

Training Cost: We knocked down the price! Training completed in just 12 days on 64 H100 GPUs.

Speed: A pre-trained model generates a 720p video within just 36 seconds.

Unified: A unified framework for Text-to-Video, Image-to-Video, and Text-to-Image—all in one!

Most important: STILL a fully open-sourced model.

📖 Paper: https://huggingface.co/papers/2509.24695
💻 Project Page: https://nvlabs.github.io/Sana/Video