Papers
arxiv:2509.24695

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Published on Sep 29
· Submitted by Yuyang on Sep 30
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

SANA-Video, a small diffusion model, efficiently generates high-resolution, high-quality videos with strong text-video alignment using linear attention and a constant-memory KV cache, achieving competitive performance at a lower cost and faster speed.

AI-generated summary

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.

Community

Paper submitter
•
edited 2 days ago

🚀 SANA-Video: Efficient AI Video Generation is Here!
It's time for a new SANA family member! We introduce SANA-Video, a small diffusion model engineered to shatter the limits of video speed, making high-quality, short and long content truly deployable on consumer hardware.

Core Efficiency & Impact
Architecture: Pure Linear Diffusion Transformer + Block Linear KV Cache = Unmatched Efficiency.

Training Cost: We knocked down the price! Training completed in just 12 days on 64 H100 GPUs.

Speed: A pre-trained model generates a 720p video within just 36 seconds.

Unified: A unified framework for Text-to-Video, Image-to-Video, and Text-to-Image—all in one!

Most important: STILL a fully open-sourced model.

📖 Paper: https://huggingface.co/papers/2509.24695
💻 Project Page: https://nvlabs.github.io/Sana/Video

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.24695 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.24695 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.24695 in a Space README.md to link it from this page.

Collections including this paper 2