RVQ-AT: Residual VQ Action Tokenizer for RDT 2
RVQ-AT is a fast, compact Residual Vector-Quantization (RVQ) tokenizer for robot action streams. It converts continuous control trajectories into short sequences of discrete action tokens that plug directly into autoregressive VLA models.
Unlike single-codebook VQ, RVQ-AT stacks multiple small codebooks and quantizes residuals level-by-level. This yields:
- Higher fidelity at the same bitrate (lower recon MSE / higher SNR)
- Shorter token sequences for the same time horizon
- Stable training via commitment loss, EMA codebook updates, and dead-code revival
Here, we provide:
- RVQ-AT — a general-purpose tokenizer trained on diverse UMI manipulation, but generalizes well on tele-operation data.
- Simple APIs to fit your own tokenizer on custom action datasets.
Using the Universal RVQ-AT Tokenizer
We recommend chunking actions into ~0.8 s windows with fps = 30 and normalizing each action dimension using normalizer to [-1, 1] before tokenization. Batched encode/decode are supported.
# Run under repository: https://github.com/thu-ml/RDT2
import torch
import numpy as np
from models.normalizer import LinearNormalizer
from vqvae.models.multivqvae import MultiVQVAE
# Load from the Hub (replace with your repo id once published)
vae = MultiVQVAE.from_pretrained("outputs/vqvae_hf").cuda().eval()
normalizer = LinearNormalizer.load(
"<Path_to_normalizer>" # Download from:
# http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
)
# Load your RELATIVE action chunk
action_chunk = torch.zeros((1, 24, 20))
# action_chunk shape: (B, T, action_dim)
# - T = 24: predicts the future 0.8s in fps=30 → 24 frames
# - action_dim = 20: following UMI setting (both arms, right to left)
# - [0-2]: RIGHT ARM end effector position in (x, y, z), unit: m
# - [3-8]: RIGHT ARM end effector rotation (6D representation)
# - [9]: RIGHT ARM gripper width, normalized to [0, 0.088], 0.088 means fully open
# - [10-12]: LEFT ARM end effector position in (x, y, z), unit: m
# - [13-18]: LEFT ARM end effector rotation (6D representation)
# - [19]: LEFT ARM gripper width, normalized to [0, 0.088], 0.088 means fully open
# Normalize action
nsample = normalizer["action"].normalize(action_chunk).cuda()
# Encode → tokens
# tokens: torch.LongTensor with shape (B, num_valid_action_token)
# num_valid_action_token = 27, values in range [0, 1024)
tokens = vae.encode(nsample) # or vae.encode(action_chunk)
# Decode back to continuous actions
recon_nsample = vae.decode(tokens)
recon_action_chunk = normalizer["action"].unnormalize(recon_nsample)
[IMPORTANT] Recommended Preprocessing
Although our Residual VQ demonstrates strong generalization across both hand-held gripper data and real robot data, we recommend that if you plan to fine-tune on your own dataset, you first verify that the statistics of your data fall within the bounds of our RVQ. Afterward, evaluate the reconstruction error on your data before using it for your own purpose, especially fine-tuning.
Safety & Intended Use
RVQ-AT is a representation learning component. Do not deploy decoded actions directly to hardware without:
- Proper sim-to-real validation,
- Safety bounds/clamping and rate limiters,
- Task-level monitors and e-stop fallbacks.
Citation
If you use RVQ-AT in your work, please cite:
@software{rdt2,
title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
author={RDT Team},
url={https://github.com/thu-ml/RDT2},
month={September},
year={2025}
}
Contact
- Issues & requests: open a GitHub issue (see here for guidelines) or start a Hub discussion on the model page.
License
This repository and the released models are licensed under Apache-2.0. You may use, modify, and distribute, provided you keep a copy of the original license and notices in your distributions and state significant changes when you make them.
- Downloads last month
- 142