SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
Abstract
SANA-Video, a small diffusion model, efficiently generates high-resolution, high-quality videos with strong text-video alignment using linear attention and a constant-memory KV cache, achieving competitive performance at a lower cost and faster speed.
We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.
Community
🚀 SANA-Video: Efficient AI Video Generation is Here!
It's time for a new SANA family member! We introduce SANA-Video, a small diffusion model engineered to shatter the limits of video speed, making high-quality, short and long content truly deployable on consumer hardware.
Core Efficiency & Impact
Architecture: Pure Linear Diffusion Transformer + Block Linear KV Cache = Unmatched Efficiency.
Training Cost: We knocked down the price! Training completed in just 12 days on 64 H100 GPUs.
Speed: A pre-trained model generates a 720p video within just 36 seconds.
Unified: A unified framework for Text-to-Video, Image-to-Video, and Text-to-Image—all in one!
Most important: STILL a fully open-sourced model.
📖 Paper: https://huggingface.co/papers/2509.24695
💻 Project Page: https://nvlabs.github.io/Sana/Video
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder (2025)
- Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer (2025)
- LongLive: Real-time Interactive Long Video Generation (2025)
- Rolling Forcing: Autoregressive Long Video Diffusion in Real Time (2025)
- SuperGen: An Efficient Ultra-high-resolution Video Generation System with Sketching and Tiling (2025)
- Autoregressive Video Generation beyond Next Frames Prediction (2025)
- BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper