VIRTUE: Visual-Interactive Text-Image Universal Embedder
Abstract
VIRTUE, a novel Visual-InteRactive Text-Image Universal Embedder, integrates segmentation and vision-language models to enable visual interactions and localized grounding, achieving state-of-the-art performance in representation learning tasks.
Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interest from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale Segmentation-and-Scene Caption Retrieval (SCaR) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (3.1%-8.5%) and five visual-interactive SCaR (15.2%-20.3%) tasks.
Community
Enabling visual prompts (bboxes, clicks, masks) to the embedding models!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs (2025)
- Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions (2025)
- X-SAM: From Segment Anything to Any Segmentation (2025)
- Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation (2025)
- Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval (2025)
- Text4Seg++: Advancing Image Segmentation via Generative Language Modeling (2025)
- LENS: Learning to Segment Anything with Unified Reinforced Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper