---
library_name: pytorch
license: mit
pipeline_tag: image-feature-extraction
tags:
- computer-vision
- image-generation
- tokenizer
- autoencoder
- denoising
- visual-tokenizer
- imagenet
- generative-modeling
---

# DeTok: Latent Denoising Makes Good Visual Tokenizers

[![arXiv](https://img.shields.io/badge/arXiv-2507.15856-b31b1b.svg)](https://arxiv.org/abs/2507.15856)&nbsp;
[![GitHub](https://img.shields.io/badge/GitHub-DeTok-blue)](https://github.com/Jiawei-Yang/DeTok)&nbsp;

## Model Description

**l-DeTok** (Latent Denoising Tokenizer) is a simple approach for training visual tokenizers by incorporating denoising objectives during tokenizer training. We observe that many modern generative models share a common training paradigm of reconstructing clean signals from corrupted inputs, and explore whether aligning tokenizer training with this principle might be beneficial for downstream generation tasks.

### Approach

We note that modern generative models often involve **reconstructing clean signals from corrupted inputs** (a form of denoising). This work investigates whether training tokenizers to reconstruct clean images from corrupted latent embeddings might produce representations that are better suited for downstream generative modeling.

### Architecture

- **Encoder-Decoder Architecture**: Based on Vision Transformers (ViT)
- **Denoising Strategies**: 
  - **Interpolative Latent Noise**: Corrupts latent embeddings through noise interpolation
  - **Random Masking**: Masks random subsets of image patches during training
- **Training Losses**: Same as conventional image tokenizers.

## Model Variants

| Model | Type | Parameters | Description |
|-------|------|------------|-------------|
| **DeTok-BB** | Tokenizer | 172M | Base tokenizer with denoising training |
| **DeTok-BB-decoder_ft** | Tokenizer | 172M | Base tokenizer with additional decoder fine-tuning |

## Results

We evaluate our approach across six generative models on ImageNet 256×256 and observe consistent improvements:

### With MAR Models (FID-50k with CFG)
| Model | FID-50K | Inception Score | Parameters |
|-------|---------|-----------------|------------|
| MAR-Base + MAR-VAE | 2.31 | 281.7 | 208M |
| MAR-Base + DeTok-BB | **1.61** | **289.7** | 208M |
| MAR-Base + DeTok-BB-decoder_ft | **1.55** | **291.0** | 208M |
| MAR-Large + MAR-VAE | 1.78 | 296.0 | 479M |
| MAR-Huge + MAR-VAE | 1.55 | 303.7 | 943M |
| MAR-Large + DeTok-BB | **1.43** | **303.5** | 479M |
| MAR-Large + DeTok-BB-decoder_ft | **1.32** | **304.1** | 479M |

### Observations
- **MAR-B**: FID improves from 2.31 (MAR-VAE) → 1.55 (Ours)
- **MAR-L**: FID improves from 1.78 (MAR-VAE) → 1.35 (Ours)
- The approach works across both non-autoregressive (DiT, SiT, LightningDiT) and autoregressive models (MAR, RasterAR, RandomAR)

## Usage

### Installation

To use DeTok for extracting latent embeddings from images, you need to:

1.  **Clone the official DeTok repository**:
    ```bash
    git clone https://github.com/Jiawei-Yang/DeTok.git
    cd DeTok
    pip install -r requirements.txt
    ```
2.  **Download the pre-trained tokenizer weights**:
    You can download the `DeTok-BB-decoder_ft` checkpoint (recommended) from [here](https://huggingface.co/jjiaweiyang/l-DeTok/resolve/main/detok-BB-gamm3.0-m0.7-decoder_tuned.pth) and place it in your working directory (e.g., `detok-BB-gamm3.0-m0.7-decoder_tuned.pth`).

### Extract latent embeddings

Here's a sample Python code snippet for feature extraction using the `DeTok_BB` tokenizer:

```python
import torch
from PIL import Image
from torchvision.transforms import transforms
from models.detok import DeTok_BB # Import from the cloned DeTok repository

# --- Configuration (matching DeTok-BB-decoder_ft architecture from paper) ---
model_params = {
    "img_size": 256,
    "patch_size": 16,
    "in_chans": 3,
    "embed_dim": 768,
    "depths": [2, 2, 8, 2],
    "num_heads": [3, 6, 12, 24],
}
tokenizer_weights_path = "detok-BB-gamm3.0-m0.7-decoder_tuned.pth" # Path to your downloaded weights

# 1. Initialize and load the tokenizer
tokenizer = DeTok_BB(**model_params).eval()
if torch.cuda.is_available():
    tokenizer = tokenizer.cuda()

# Load checkpoint state_dict
checkpoint = torch.load(tokenizer_weights_path, map_location='cpu')
tokenizer.load_state_dict(checkpoint['model'])

# 2. Prepare your image
transform = transforms.Compose([
    transforms.Resize(model_params["img_size"]),
    transforms.CenterCrop(model_params["img_size"]),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

# Replace 'path/to/your/image.jpg' with your actual image file
image = Image.new('RGB', (model_params["img_size"], model_params["img_size"]), color = 'red') # Example dummy image
# image = Image.open("path/to/your/image.jpg").convert("RGB")

pixel_values = transform(image).unsqueeze(0) # Add batch dimension

if torch.cuda.is_available():
    pixel_values = pixel_values.cuda()

# 3. Extract latent embeddings
with torch.no_grad():
    latent_embeddings = tokenizer.encode(pixel_values)

print(f"Shape of latent embeddings: {latent_embeddings.shape}")
# Expected output for a 256x256 input image with 16x16 patches is (1, 256, 768),
# representing 256 image patches with 768-dimensional embeddings.
```

## Training Details

### Tokenizer Training
- **Dataset**: ImageNet train set
- **Resolution**: 256×256
- **Batch Size**: 1024 (global)
- **Epochs**: 200 (base) + 100 (decoder fine-tuning)
- **Denoising Parameters**:
  - Gamma (noise strength): 3.0
  - Mask ratio: 0.7
  - Random masking ratio: sampled from max(0, U(-0.1, M))

### Key Training Components
- **Interpolative Noise**: x' = (1-τ)x + τε(γ), where τ ~ U(0,1)
- **Random Masking**: Variable masking ratios during training
- **Multi-component Loss**: MSE + KL + Perceptual + Adversarial losses

## Technical Details

### Denoising Methodology

1. **Interpolative Latent Noise**: We use interpolative noise rather than additive noise, which allows for heavier corruption when the noise level τ is high
2. **Masking as Deconstruction**: We explore random masking as another form of latent deconstruction, inspired by masked autoencoders
3. **Downstream Alignment**: The denoising objective is designed to align with how modern generative models operate

### Potential Benefits

- **Task Alignment**: The training objective is designed to match downstream generative model objectives
- **Simplicity**: The approach works without requiring large-scale pretrained visual encoders or semantic distillation
- **Generality**: We observe improvements across different types of generative models
- **Robustness**: The learned representations appear to remain useful even under corruption

## Evaluation

### Datasets
- **Training**: ImageNet train set
- **Evaluation**: ImageNet validation set (50k images)

### Metrics
- **FID-50k**: Fréchet Inception Distance on 50,000 generated samples
- **Inception Score**: Standard generative model evaluation metric
- **Precision & Recall**: Using ImageNet validation precision-recall data

## Citation

```bibtex
@article{yang2025detok,
  title={Latent Denoising Makes Good Visual Tokenizers},
  author={Jiawei Yang and Tianhong Li and Lijie Fan and Yonglong Tian and Yue Wang},
  journal={arXiv preprint arXiv:2507.15856},
  year={2025}
}
```

## License

This project is licensed under the MIT License.

## Acknowledgments

This work builds upon many excellent open-source projects. We are particularly grateful to:
- [MAR](https://github.com/LTH14/mar) for masked autoregressive modeling
- [DiT](https://github.com/facebookresearch/DiT) for diffusion transformers  
- [MAE](https://github.com/facebookresearch/mae) for masked autoencoder insights
- [1d-tokenizer](https://github.com/bytedance/1d-tokenizer) for tokenizer implementations
- The broader research community for foundational work in generative modeling

## Contact

For questions or issues, please open a GitHub issue at the [official repository](https://github.com/Jiawei-Yang/DeTok).