--- library_name: pytorch license: mit pipeline_tag: image-feature-extraction tags: - computer-vision - image-generation - tokenizer - autoencoder - denoising - visual-tokenizer - imagenet - generative-modeling --- # DeTok: Latent Denoising Makes Good Visual Tokenizers [![arXiv](https://img.shields.io/badge/arXiv-2507.15856-b31b1b.svg)](https://arxiv.org/abs/2507.15856)  [![GitHub](https://img.shields.io/badge/GitHub-DeTok-blue)](https://github.com/Jiawei-Yang/DeTok)  ## Model Description **l-DeTok** (Latent Denoising Tokenizer) is a simple approach for training visual tokenizers by incorporating denoising objectives during tokenizer training. We observe that many modern generative models share a common training paradigm of reconstructing clean signals from corrupted inputs, and explore whether aligning tokenizer training with this principle might be beneficial for downstream generation tasks. ### Approach We note that modern generative models often involve **reconstructing clean signals from corrupted inputs** (a form of denoising). This work investigates whether training tokenizers to reconstruct clean images from corrupted latent embeddings might produce representations that are better suited for downstream generative modeling. ### Architecture - **Encoder-Decoder Architecture**: Based on Vision Transformers (ViT) - **Denoising Strategies**: - **Interpolative Latent Noise**: Corrupts latent embeddings through noise interpolation - **Random Masking**: Masks random subsets of image patches during training - **Training Losses**: Same as conventional image tokenizers. ## Model Variants | Model | Type | Parameters | Description | |-------|------|------------|-------------| | **DeTok-BB** | Tokenizer | 172M | Base tokenizer with denoising training | | **DeTok-BB-decoder_ft** | Tokenizer | 172M | Base tokenizer with additional decoder fine-tuning | ## Results We evaluate our approach across six generative models on ImageNet 256×256 and observe consistent improvements: ### With MAR Models (FID-50k with CFG) | Model | FID-50K | Inception Score | Parameters | |-------|---------|-----------------|------------| | MAR-Base + MAR-VAE | 2.31 | 281.7 | 208M | | MAR-Base + DeTok-BB | **1.61** | **289.7** | 208M | | MAR-Base + DeTok-BB-decoder_ft | **1.55** | **291.0** | 208M | | MAR-Large + MAR-VAE | 1.78 | 296.0 | 479M | | MAR-Huge + MAR-VAE | 1.55 | 303.7 | 943M | | MAR-Large + DeTok-BB | **1.43** | **303.5** | 479M | | MAR-Large + DeTok-BB-decoder_ft | **1.32** | **304.1** | 479M | ### Observations - **MAR-B**: FID improves from 2.31 (MAR-VAE) → 1.55 (Ours) - **MAR-L**: FID improves from 1.78 (MAR-VAE) → 1.35 (Ours) - The approach works across both non-autoregressive (DiT, SiT, LightningDiT) and autoregressive models (MAR, RasterAR, RandomAR) ## Usage ### Installation To use DeTok for extracting latent embeddings from images, you need to: 1. **Clone the official DeTok repository**: ```bash git clone https://github.com/Jiawei-Yang/DeTok.git cd DeTok pip install -r requirements.txt ``` 2. **Download the pre-trained tokenizer weights**: You can download the `DeTok-BB-decoder_ft` checkpoint (recommended) from [here](https://huggingface.co/jjiaweiyang/l-DeTok/resolve/main/detok-BB-gamm3.0-m0.7-decoder_tuned.pth) and place it in your working directory (e.g., `detok-BB-gamm3.0-m0.7-decoder_tuned.pth`). ### Extract latent embeddings Here's a sample Python code snippet for feature extraction using the `DeTok_BB` tokenizer: ```python import torch from PIL import Image from torchvision.transforms import transforms from models.detok import DeTok_BB # Import from the cloned DeTok repository # --- Configuration (matching DeTok-BB-decoder_ft architecture from paper) --- model_params = { "img_size": 256, "patch_size": 16, "in_chans": 3, "embed_dim": 768, "depths": [2, 2, 8, 2], "num_heads": [3, 6, 12, 24], } tokenizer_weights_path = "detok-BB-gamm3.0-m0.7-decoder_tuned.pth" # Path to your downloaded weights # 1. Initialize and load the tokenizer tokenizer = DeTok_BB(**model_params).eval() if torch.cuda.is_available(): tokenizer = tokenizer.cuda() # Load checkpoint state_dict checkpoint = torch.load(tokenizer_weights_path, map_location='cpu') tokenizer.load_state_dict(checkpoint['model']) # 2. Prepare your image transform = transforms.Compose([ transforms.Resize(model_params["img_size"]), transforms.CenterCrop(model_params["img_size"]), transforms.ToTensor(), transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]) ]) # Replace 'path/to/your/image.jpg' with your actual image file image = Image.new('RGB', (model_params["img_size"], model_params["img_size"]), color = 'red') # Example dummy image # image = Image.open("path/to/your/image.jpg").convert("RGB") pixel_values = transform(image).unsqueeze(0) # Add batch dimension if torch.cuda.is_available(): pixel_values = pixel_values.cuda() # 3. Extract latent embeddings with torch.no_grad(): latent_embeddings = tokenizer.encode(pixel_values) print(f"Shape of latent embeddings: {latent_embeddings.shape}") # Expected output for a 256x256 input image with 16x16 patches is (1, 256, 768), # representing 256 image patches with 768-dimensional embeddings. ``` ## Training Details ### Tokenizer Training - **Dataset**: ImageNet train set - **Resolution**: 256×256 - **Batch Size**: 1024 (global) - **Epochs**: 200 (base) + 100 (decoder fine-tuning) - **Denoising Parameters**: - Gamma (noise strength): 3.0 - Mask ratio: 0.7 - Random masking ratio: sampled from max(0, U(-0.1, M)) ### Key Training Components - **Interpolative Noise**: x' = (1-τ)x + τε(γ), where τ ~ U(0,1) - **Random Masking**: Variable masking ratios during training - **Multi-component Loss**: MSE + KL + Perceptual + Adversarial losses ## Technical Details ### Denoising Methodology 1. **Interpolative Latent Noise**: We use interpolative noise rather than additive noise, which allows for heavier corruption when the noise level τ is high 2. **Masking as Deconstruction**: We explore random masking as another form of latent deconstruction, inspired by masked autoencoders 3. **Downstream Alignment**: The denoising objective is designed to align with how modern generative models operate ### Potential Benefits - **Task Alignment**: The training objective is designed to match downstream generative model objectives - **Simplicity**: The approach works without requiring large-scale pretrained visual encoders or semantic distillation - **Generality**: We observe improvements across different types of generative models - **Robustness**: The learned representations appear to remain useful even under corruption ## Evaluation ### Datasets - **Training**: ImageNet train set - **Evaluation**: ImageNet validation set (50k images) ### Metrics - **FID-50k**: Fréchet Inception Distance on 50,000 generated samples - **Inception Score**: Standard generative model evaluation metric - **Precision & Recall**: Using ImageNet validation precision-recall data ## Citation ```bibtex @article{yang2025detok, title={Latent Denoising Makes Good Visual Tokenizers}, author={Jiawei Yang and Tianhong Li and Lijie Fan and Yonglong Tian and Yue Wang}, journal={arXiv preprint arXiv:2507.15856}, year={2025} } ``` ## License This project is licensed under the MIT License. ## Acknowledgments This work builds upon many excellent open-source projects. We are particularly grateful to: - [MAR](https://github.com/LTH14/mar) for masked autoregressive modeling - [DiT](https://github.com/facebookresearch/DiT) for diffusion transformers - [MAE](https://github.com/facebookresearch/mae) for masked autoencoder insights - [1d-tokenizer](https://github.com/bytedance/1d-tokenizer) for tokenizer implementations - The broader research community for foundational work in generative modeling ## Contact For questions or issues, please open a GitHub issue at the [official repository](https://github.com/Jiawei-Yang/DeTok).