RVQ-AT: Residual VQ Action Tokenizer for RDT 2

RVQ-AT is a fast, compact Residual Vector-Quantization (RVQ) tokenizer for robot action streams. It converts continuous control trajectories into short sequences of discrete action tokens that plug directly into autoregressive VLA models.

Unlike single-codebook VQ, RVQ-AT stacks multiple small codebooks and quantizes residuals level-by-level. This yields:

Higher fidelity at the same bitrate (lower recon MSE / higher SNR)
Shorter token sequences for the same time horizon
Stable training via commitment loss, EMA codebook updates, and dead-code revival

Here, we provide:

RVQ-AT — a general-purpose tokenizer trained on diverse UMI manipulation, but generalizes well on tele-operation data.
Simple APIs to fit your own tokenizer on custom action datasets.

Using the Universal RVQ-AT Tokenizer

We recommend chunking actions into ~0.8 s windows with fps = 30 and normalizing each action dimension using normalizer to [-1, 1] before tokenization. Batched encode/decode are supported.

# Run under repository: https://github.com/thu-ml/RDT2

import torch
import numpy as np
from models.normalizer import LinearNormalizer
from vqvae.models.multivqvae import MultiVQVAE

# Load from the Hub (replace with your repo id once published)
vae = MultiVQVAE.from_pretrained("outputs/vqvae_hf").cuda().eval()
normalizer = LinearNormalizer.load(
    "<Path_to_normalizer>"  # Download from:
    # http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
)

# Load your RELATIVE action chunk
action_chunk = torch.zeros((1, 24, 20))  
# action_chunk shape: (B, T, action_dim)
#   - T = 24: predicts the future 0.8s in fps=30 → 24 frames
#   - action_dim = 20: following UMI setting (both arms, right to left)
#     - [0-2]:  RIGHT ARM end effector position in (x, y, z), unit: m
#     - [3-8]:  RIGHT ARM end effector rotation (6D representation)
#     - [9]:    RIGHT ARM gripper width, normalized to [0, 0.088], 0.088 means fully open
#     - [10-12]: LEFT ARM end effector position in (x, y, z), unit: m
#     - [13-18]: LEFT ARM end effector rotation (6D representation)
#     - [19]:   LEFT ARM gripper width, normalized to [0, 0.088], 0.088 means fully open

# Normalize action
nsample = normalizer["action"].normalize(action_chunk).cuda()

# Encode → tokens
# tokens: torch.LongTensor with shape (B, num_valid_action_token)
# num_valid_action_token = 27, values in range [0, 1024)
tokens = vae.encode(nsample)  # or vae.encode(action_chunk)

# Decode back to continuous actions
recon_nsample = vae.decode(tokens)
recon_action_chunk = normalizer["action"].unnormalize(recon_nsample)

[IMPORTANT] Recommended Preprocessing

Although our Residual VQ demonstrates strong generalization across both hand-held gripper data and real robot data, we recommend that if you plan to fine-tune on your own dataset, you first verify that the statistics of your data fall within the bounds of our RVQ. Afterward, evaluate the reconstruction error on your data before using it for your own purpose, especially fine-tuning.

Safety & Intended Use

RVQ-AT is a representation learning component. Do not deploy decoded actions directly to hardware without:

Proper sim-to-real validation,
Safety bounds/clamping and rate limiters,
Task-level monitors and e-stop fallbacks.

Citation

If you use RVQ-AT in your work, please cite:

@software{rdt2,
    title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
    author={RDT Team},
    url={https://github.com/thu-ml/RDT2},
    month={September},
    year={2025}
}

Contact

Issues & requests: open a GitHub issue (see here for guidelines) or start a Hub discussion on the model page.

License

This repository and the released models are licensed under Apache-2.0. You may use, modify, and distribute, provided you keep a copy of the original license and notices in your distributions and state significant changes when you make them.

Downloads last month: 142

Safetensors

Model size

436M params

Tensor type

F32

I32

Video Preview

Robotics

Collection including robotics-diffusion-transformer/RVQActionTokenizer

RDT 2

Collection

RDT 2, the sequel to RDT-1B, is the first foundation model that achieves zero-shot deployment on unseen embodiments for simple open-vocabulary tasks. • 4 items • Updated 8 days ago • 13