Diffusers documentation

GGUF

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.34.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

GGUF

The GGUF file format is typically used to store models for inference with GGML and supports a variety of block wise quantization options. Diffusers supports loading checkpoints prequantized and saved in the GGUF format via from_single_file loading with Model classes. Loading GGUF checkpoints via Pipelines is currently not supported.

The following example will load the FLUX.1 DEV transformer model using the GGUF Q2_K quantization variant.

Before starting please install gguf in your environment

pip install -U gguf

Since GGUF is a single file format, use ~FromSingleFileMixin.from_single_file to load the model and pass in the GGUFQuantizationConfig.

When using GGUF checkpoints, the quantized weights remain in a low memory dtype(typically torch.uint8) and are dynamically dequantized and cast to the configured compute_dtype during each module’s forward pass through the model. The GGUFQuantizationConfig allows you to set the compute_dtype.

The functions used for dynamic dequantizatation are based on the great work done by city96, who created the Pytorch ports of the original numpy implementation by compilade.

import torch

from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig

ckpt_path = (
    "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf"
)
transformer = FluxTransformer2DModel.from_single_file(
    ckpt_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
prompt = "A cat holding a sign that says hello world"
image = pipe(prompt, generator=torch.manual_seed(0)).images[0]
image.save("flux-gguf.png")

Using Optimized CUDA Kernels with GGUF

Optimized CUDA kernels can accelerate GGUF quantized model inference by approximately 10%. This functionality requires a compatible GPU with torch.cuda.get_device_capability greater than 7 and the kernels library:

pip install -U kernels

Once installed, set DIFFUSERS_GGUF_CUDA_KERNELS=true to use optimized kernels when available. Note that CUDA kernels may introduce minor numerical differences compared to the original GGUF implementation, potentially causing subtle visual variations in generated images. To disable CUDA kernel usage, set the environment variable DIFFUSERS_GGUF_CUDA_KERNELS=false.

Supported Quantization Types

  • BF16
  • Q4_0
  • Q4_1
  • Q5_0
  • Q5_1
  • Q8_0
  • Q2_K
  • Q3_K
  • Q4_K
  • Q5_K
  • Q6_K
< > Update on GitHub