WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation
Abstract
WristWorld is a 4D world model that generates wrist-view videos from anchor views, improving video generation consistency and VLA performance.
Wrist-view observations are crucial for VLA models as they capture fine-grained hand-object interactions that directly enhance manipulation performance. Yet large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with geometric and cross-view priors that make it possible to address extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.
Community
The wrist view is crucial for VLA models, and this paper bridges 42% of the gap through a generative approach. Does this mean that, fundamentally, generative methods are indeed beneficial for VLAs? Furthermore, how can a 4D World Model be directly integrated into a VLA model in the future?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction (2025)
- GP3: A 3D Geometry-Aware Policy with Multi-View Images for Robotic Manipulation (2025)
- WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving (2025)
- ShapeGen4D: Towards High Quality 4D Shape Generation from Videos (2025)
- Imagination at Inference: Synthesizing In-Hand Views for Robust Visuomotor Policy Inference (2025)
- EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory (2025)
- PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper