arxiv:2509.15212

RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation

Published on Sep 18

· Submitted by

taesiri on Sep 19

Upvote

Authors:

Yuming Jiang ,

Siteng Huang ,

Yaxi Zhao ,

Sicong Leng ,

Jiayan Guo ,

Abstract

RynnVLA-001, a vision-language-action model, uses a two-stage pretraining approach and ActionVAE to achieve superior performance on robotics tasks.

AI-generated summary

This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.