File size: 6,345 Bytes
2d1f08e 5ec830d 2d1f08e 5ec830d 10c2241 2d1f08e f44310a 2d1f08e cde4e56 2d1f08e cde4e56 2d1f08e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
---
license: apache-2.0
base_model:
- facebook/chameleon-7b
tags:
- VLA
- Robotics
---
<p align="center">
<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/logo.png?raw=true" width="80" style="margin-bottom: 0.1;"/>
<p>
<h3 align="center"><a href="https://github.com/alibaba-damo-academy/WorldVLA/tree/main" style="color:#9C276A">
WorldVLA: Towards Autoregressive Action World Model</a></h3>
<h5 align="center"> If our project helps you, please give us a star β on GitHub to support us. ππ </h2>
<h5 align="center">
[](https://arxiv.org/pdf/2506.21539)
[](https://github.com/alibaba-damo-academy/WorldVLA)
[](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA)
[](https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/LICENSE)
</h5>
## π Introduction
WorldVLA is an autoregressive action world model that unifies action and image understanding and generation. WorldVLA intergrates Vision-Language-Action (VLA) model (action model) and world model in one single framework.
<div style="text-align: center;">
<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/overview.png" style="max-width: 100%; height: auto; display: block; margin: 0 auto;">
</div>
<br>
### Action Model Results (Text + Image -> Action)
Action Model generates actions given the text instruction and image observations.
<table>
<tr>
<td width="300">
<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/action_model_open_the_middle_drawer_of_the_cabinet.gif" width="100%">
</td>
<td width="300">
<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/action_model_pick_up_the_alphabet_soup_and_place_it_in_the_bask.gif" width="100%">
</td>
<td width="300">
<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/action_model_pick_up_the_black_bowl_between_the_plate_and_the_r.gif" width="100%">
</td>
</tr>
<tr>
<td><center>Input: Open the middle drawer of the cabinet.</center></td>
<td><center>Input: Pick up the alphabet soup and place it in the basket.</center></td>
<td><center>Input: Pick up the black bowl between the plate and the ramekin and place it on the plate.</center></td>
</tr>
</table>
### World Model Results (Action + Image -> Image)
World Model generates the next frame given the current frame and action control.
<table>
<tr>
<td width="300" align="center">
<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/world_model_open_the_top_drawer_and_put_the_bowl_inside.gif" width="100%">
</td>
<td width="300" align="center">
<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/world_model_push_the_plate_to_the_front_of_the_stove.gif" width="100%">
</td>
<td width="300" align="center">
<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/world_model_put_the_bowl_on_the_stove.gif" width="100%">
</td>
</tr>
<tr>
<td align="center">
Input: Action sequence of "Open the top drawer and put the bowl inside".
</td>
<td align="center">
Input: Action sequence of "Push the plate to the front of the stove".
</td>
<td align="center">
Input: Action sequence of "Put the bowl on the stove".
</td>
</tr>
</table>
## Model Zoo
| Model (256 * 256) | HF Link | Success Rate (%) |
| :--------------------: | :------------------------------------------------------------: | :--------------------: |
| LIBERO-Spatial | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_spatial](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_spatial) | 85.6 |
| LIBERO-Object | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_object](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_object) | 89.0 |
| LIBERO-Goal | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_goal](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_goal) | 82.6 |
| LIBERO-Long | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_10) | 59.0 |
<br>
| Model (512 * 512) | HF Link | Success Rate (%) |
| :--------------------: | :------------------------------------------------------------: | :--------------------: |
| LIBERO-Spatial | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_spatial](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_spatial) | 87.6 |
| LIBERO-Object | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_object](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_object) | 96.2 |
| LIBERO-Goal | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_goal](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_goal) | 83.4 |
| LIBERO-Long | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_10) | 60.0 |
## Citation <a name="citation"></a>
If you find the project helpful for your research, please consider citing our paper:
```bibtex
@article{cen2025worldvla,
title={WorldVLA: Towards Autoregressive Action World Model},
author={Cen, Jun and Yu, Chaohui and Yuan, Hangjie and Jiang, Yuming and Huang, Siteng and Guo, Jiayan and Li, Xin and Song, Yibing and Luo, Hao and Wang, Fan and others},
journal={arXiv preprint arXiv:2506.21539},
year={2025}
}
```
## Acknowledgment <a name="acknowledgment"></a>
This project builds upon [Lumina-mGPT](https://github.com/Alpha-VLLM/Lumina-mGPT), [Chemeleon](https://github.com/facebookresearch/chameleon), and [OpenVLA](http://github.com/openvla/openvla). We thank these teams for their open-source contributions.
|