File size: 6,345 Bytes
2d1f08e
 
 
 
 
 
 
 
 
 
 
 
5ec830d
2d1f08e
 
 
 
 
 
5ec830d
 
10c2241
2d1f08e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f44310a
2d1f08e
 
 
cde4e56
 
 
 
2d1f08e
 
 
 
cde4e56
 
 
 
2d1f08e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
license: apache-2.0
base_model:
- facebook/chameleon-7b
tags:
- VLA
- Robotics
---
<p align="center">
    <img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/logo.png?raw=true" width="80" style="margin-bottom: 0.1;"/>
<p>

<h3 align="center"><a href="https://github.com/alibaba-damo-academy/WorldVLA/tree/main" style="color:#9C276A">
WorldVLA: Towards Autoregressive Action World Model</a></h3>
<h5 align="center"> If our project helps you, please give us a star ⭐ on GitHub to support us. πŸ™πŸ™ </h2>


<h5 align="center">

[![arXiv](https://img.shields.io/badge/Arxiv-2506.21539-AD1C18.svg?logo=arXiv)](https://arxiv.org/pdf/2506.21539)
[![GitHub](https://img.shields.io/badge/GitHub-WorldVLA-9cf?logo=github)](https://github.com/alibaba-damo-academy/WorldVLA)
[![hf_checkpoint](https://img.shields.io/badge/πŸ€—-Checkpoints-9C276A.svg)](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA)
[![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/LICENSE) 
</h5>


## 🌟 Introduction
WorldVLA is an autoregressive action world model that unifies action and image understanding and generation. WorldVLA intergrates Vision-Language-Action (VLA) model (action model) and world model in one single framework.

<div style="text-align: center;">
  <img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/overview.png" style="max-width: 100%; height: auto; display: block; margin: 0 auto;">
</div>
<br>

### Action Model Results (Text + Image -> Action)
Action Model generates actions given the text instruction and image observations.

<table>
  <tr>
    <td width="300">
      <img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/action_model_open_the_middle_drawer_of_the_cabinet.gif" width="100%">
    </td>
    <td width="300">
      <img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/action_model_pick_up_the_alphabet_soup_and_place_it_in_the_bask.gif" width="100%">
    </td>
    <td width="300">
      <img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/action_model_pick_up_the_black_bowl_between_the_plate_and_the_r.gif" width="100%">
    </td>
  </tr>
  <tr>
    <td><center>Input: Open the middle drawer of the cabinet.</center></td>
    <td><center>Input: Pick up the alphabet soup and place it in the basket.</center></td>
    <td><center>Input: Pick up the black bowl between the plate and the ramekin and place it on the plate.</center></td>
  </tr>
</table>

### World Model Results (Action + Image -> Image)
World Model generates the next frame given the current frame and action control.

<table>
  <tr>
    <td width="300" align="center">
      <img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/world_model_open_the_top_drawer_and_put_the_bowl_inside.gif" width="100%">
    </td>
    <td width="300" align="center">
      <img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/world_model_push_the_plate_to_the_front_of_the_stove.gif" width="100%">
    </td>
    <td width="300" align="center">
      <img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/world_model_put_the_bowl_on_the_stove.gif" width="100%">
    </td>
  </tr>
  <tr>
    <td align="center">
      Input: Action sequence of "Open the top drawer and put the bowl inside".
    </td>
    <td align="center">
      Input: Action sequence of "Push the plate to the front of the stove".
    </td>
    <td align="center">
      Input: Action sequence of "Put the bowl on the stove".
    </td>
  </tr>
</table>


## Model Zoo

| Model (256 * 256)    |    HF Link        |    Success Rate (%)     |
| :--------------------: | :------------------------------------------------------------: | :--------------------: |
| LIBERO-Spatial       | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_spatial](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_spatial) | 85.6 |
| LIBERO-Object       | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_object](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_object) | 89.0 |
| LIBERO-Goal | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_goal](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_goal) | 82.6 |
| LIBERO-Long | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_10) | 59.0 |
<br>

| Model (512 * 512)    |    HF Link        |    Success Rate (%)     |
| :--------------------: | :------------------------------------------------------------: | :--------------------: |
| LIBERO-Spatial       | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_spatial](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_spatial) | 87.6 |
| LIBERO-Object       | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_object](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_object) | 96.2 |
| LIBERO-Goal | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_goal](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_goal) | 83.4 |
| LIBERO-Long | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_10) | 60.0 |


## Citation <a name="citation"></a>
If you find the project helpful for your research, please consider citing our paper:
```bibtex
@article{cen2025worldvla,
  title={WorldVLA: Towards Autoregressive Action World Model},
  author={Cen, Jun and Yu, Chaohui and Yuan, Hangjie and Jiang, Yuming and Huang, Siteng and Guo, Jiayan and Li, Xin and Song, Yibing and Luo, Hao and Wang, Fan and others},
  journal={arXiv preprint arXiv:2506.21539},
  year={2025}
}
```

## Acknowledgment <a name="acknowledgment"></a>
This project builds upon [Lumina-mGPT](https://github.com/Alpha-VLLM/Lumina-mGPT), [Chemeleon](https://github.com/facebookresearch/chameleon), and [OpenVLA](http://github.com/openvla/openvla). We thank these teams for their open-source contributions.