UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
Abstract
A pre-trained video diffusion transformer, UniVid, is fine-tuned to handle diverse vision tasks without task-specific modifications, demonstrating cross-modal and cross-source generalization.
Large language models, trained on extensive corpora, successfully unify diverse linguistic tasks within a single generative framework. Inspired by this, recent works like Large Vision Model (LVM) extend this paradigm to vision by organizing tasks into sequential visual sentences, where visual prompts serve as the context to guide outputs. However, such modeling requires task-specific pre-training across modalities and sources, which is costly and limits scalability to unseen tasks. Given that pre-trained video generation models inherently capture temporal sequence dependencies, we explore a more unified and scalable alternative: can a pre-trained video generation model adapt to diverse image and video tasks? To answer this, we propose UniVid, a framework that fine-tunes a video diffusion transformer to handle various vision tasks without task-specific modifications. Tasks are represented as visual sentences, where the context sequence defines both the task and the expected output modality. We evaluate the generalization of UniVid from two perspectives: (1) cross-modal inference with contexts composed of both images and videos, extending beyond LVM's uni-modal setting; (2) cross-source tasks from natural to annotated data, without multi-source pre-training. Despite being trained solely on natural video data, UniVid generalizes well in both settings. Notably, understanding and generation tasks can easily switch by simply reversing the visual sentence order in this paradigm. These findings highlight the potential of pre-trained video generation models to serve as a scalable and unified foundation for vision modeling. Our code will be released at https://github.com/CUC-MIPG/UniVid.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation (2025)
- Leveraging Vision-Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization (2025)
- Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation (2025)
- MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer (2025)
- OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment (2025)
- CoDoL: Conditional Domain Prompt Learning for Out-of-Distribution Generalization (2025)
- VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper