Robix: A Unified Model for Robot Interaction, Reasoning and Planning
Abstract
Robix, a unified vision-language model, integrates robot reasoning, task planning, and natural language interaction, demonstrating superior performance in interactive task execution through chain-of-thought reasoning and a three-stage training strategy.
We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.
Community
Video DEMO: https://www.youtube.com/embed/-uEDN31Ne_Y
The main features of Robix are summarized as follows:
🌟 Unified model. Robix is a single vision-language model that unifies robot reasoning, task planning, and human-robot interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally in an end-to-end manner.
🌟 Flexible interaction. Within this unified framework, Robix supports proactive dialogue to clarify ambiguity and infer user intent, real-time interruption handling that seamlessly incorporates feedback, and context-aware commonsense reasoning for complex, open-ended tasks.
🌟 Robust Performance. We assess Robix in two setups: (i) on a curated interactive-task benchmark covering both in- and out-of-distribution scenarios with diverse instruction types, and (ii) across five real-world scenarios in a hierarchical robot system with both human teleoperation and an automatic VLA model as the low-level controller. These evaluations demonstrate that Robix consistently delivers strong performance across all settings.

is it a VLA or VLM?
Also why is it being compared to 4o & 2.5 Pro instead of robot specific models!?
Robix is a vision–language model (VLM) designed for unified robotic task planning and natural human interaction. In our experiments, we compare it against recent embodied models such as Cosmos-Reason1 and RoboBrain-2.0. Since our focus is on modeling complex interactive processes, there are currently no other open-source models that serve as suitable baselines. From our results, however, large commercial models like Gemini-2.5-Pro and GPT-4o demonstrate stronger performance in capturing complex multimodal interactions, making them more competitive references.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper