arxiv:2509.18094

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

Published on Sep 22

· Submitted by

Ye Liu on Sep 23

Upvote

Authors:

Ye Liu ,

Zongyang Ma ,

Abstract

UniPixel, a large multi-modal model, integrates pixel-level perception with general visual understanding, enabling fine-grained reasoning across various tasks including pixel-level referring, segmentation, and question answering.

AI-generated summary

Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.

View arXiv page View PDF Project page GitHub 68 Add to collection

Community

yeliudev

Paper author Paper submitter 13 days ago

This comment has been hidden (marked as Resolved)

yeliudev

Paper author Paper submitter 13 days ago

🚀 Moving from holistic to pixel-level MLLM!

We are excited to introduce UniPixel, an MLLM for unified object referring and segmentation, which has been accepted by NeurIPS2025.

🤖 The first unified MLLM supporting flexible object referring and segmentation in images and videos, and integrating these capabilities to pixel-level visual reasoning

📊 Strong segmentation, regional understanding, and VideoQA performance achieved on 10 public benchmarks.

🎬 We also introduce a novel PixelQA task that jointly requires object-centric referring, segmentation, and QA in videos, where UniPixel establishes a strong baseline for this setting.

🏠 Project Page: https://polyu-chenlab.github.io/unipixel
📕 arXiv: https://arxiv.org/abs/2509.18094
🕹️ GitHub: https://github.com/PolyU-ChenLab/UniPixel

librarian-bot

13 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

Abstract

Community

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 4