diffusers-internal-dev (diffusers-internal-dev)

Fast LoRA inference for Flux with Diffusers and PEFT 🚨

There are great materials that demonstrate how to optimize inference for popular image generation models, such as Flux. However, very few cover how to serve LoRAs fast, despite LoRAs being an inseparable part of their adoption.

In our latest post, @BenjaminB and I show different techniques to optimize LoRA inference for the Flux family of models for image generation. Our recipe includes the use of:

1. torch.compile
2. Flash Attention 3 (when compatible)
3. Dynamic FP8 weight quantization (when compatible)
4. Hotswapping for avoiding recompilation during swapping new LoRAs 🤯

We have tested our recipe with Flux.1-Dev on both H100 and RTX 4090. We achieve at least a *2x speedup* in either of the GPUs. We believe our recipe is grounded in the reality of how LoRA-based use cases are generally served. So, we hope this will be beneficial to the community 🤗

Even though our recipe was tested primarily with NVIDIA GPUs, it should also work with AMD GPUs.

Learn the details and the full code here:
https://huggingface.co/blog/lora-fast

a-r-r-o-w

updated a model 8 days ago

diffusers-internal-dev/modular-wan-t2v

Updated 8 days ago

a-r-r-o-w

published a model 8 days ago

diffusers-internal-dev/modular-wan-t2v

Updated 8 days ago

dn6

updated a model 14 days ago

diffusers-internal-dev/florence2-image-annotator-no-repo-info

Updated 14 days ago • 37

dn6

published a model 14 days ago

diffusers-internal-dev/florence2-image-annotator-no-repo-info

Updated 14 days ago • 37

dn6

updated a model 21 days ago

diffusers-internal-dev/cosmos-predict2-text-2-image

Updated 21 days ago

dn6

published a model 21 days ago

diffusers-internal-dev/cosmos-predict2-text-2-image

Updated 21 days ago

a-r-r-o-w

posted an update 22 days ago

Post

3219

Caching is an essential technique used in diffusion inference serving for speeding up image/video generations. Diffusers just added support for another caching method: First Block Cache - a technique developed by @chengzeyi building upon the ideas of TeaCache.

The idea in short is: if the model predictions do not vary much over successive inference steps, we can skip certain steps where the prediction difference is small. To figure out whether an inference step will make a significant improvement to the overall velocity/noise prediction, we calculate the relative difference of the output of the first transformer block at timestep $t$ with $t-1$, and compare it against a selected threshold. If the difference is lower than the threshold, we skip the step. A higher threshold will lead to more steps being skipped. However, skipping many steps is bad because it can throw off the model predictions, and so we need to test and select the threshold based on level of quality-speed tradeoff for every model we use it with.

Diffusers usage with CogView4:

import torch
from diffusers import CogView4Pipeline
from diffusers.hooks import apply_first_block_cache, FirstBlockCacheConfig

pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16)
pipe.to("cuda")

apply_first_block_cache(pipe.transformer, FirstBlockCacheConfig(threshold=0.2))

prompt = "A photo of an astronaut riding a horse on mars"
image = pipe(prompt, generator=torch.Generator().manual_seed(42)).images[0]
image.save("output.png")

Below, you'll find the benchmarks and visualizations of the predicted output at different blocks of the Flux DiT.

Docs: https://huggingface.co/docs/diffusers/main/en/optimization/cache
PR: https://github.com/huggingface/diffusers/pull/11180

References:
- First Block Cache: https://github.com/chengzeyi/ParaAttention
- TeaCache: https://github.com/ali-vilab/TeaCache

1 reply

·

a-r-r-o-w

posted an update about 1 month ago

Post

2829

As you might have already heard, FLUX.1-Kontext-dev is now released and taken the generative community by storm!

In case you haven't come across it, you can get started with Kontext using 🤗 diffusers. See the official [model]( black-forest-labs/FLUX.1-Kontext-dev) and [docs](https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux#flux).

Want to know how inference companies like Fal & Replicate are able to run the model so fast and in under 2 seconds per image? Check out this [gist](https://gist.github.com/a-r-r-o-w/d08c37e8bd3e9c26b4ce80360be148c6) for some details!

1 reply

·

multimodalart

posted an update about 1 month ago

Post

8451

Self-Forcing - a real-time video distilled model from Wan 2.1 by @adobe is out, and they open sourced it 🐐

I've built a live real time demo on Spaces 📹💨

multimodalart/self-forcing

5 replies

·

dn6

updated a model about 1 month ago

diffusers-internal-dev/chroma-test

Text-to-Image • Updated Jun 17 • 2

dn6

published a model about 1 month ago

diffusers-internal-dev/chroma-test

Text-to-Image • Updated Jun 17 • 2

a-r-r-o-w

posted an update about 2 months ago

Post

2288

New diffusion model for text-to-image and video-to-world generation: Cosmos Predict-2 👽

Model collection: nvidia/cosmos-predict2-68028efc052239369a0f2959
Diffusers support: https://github.com/huggingface/diffusers/pull/11695
Documentation: https://huggingface.co/docs/diffusers/main/en/api/pipelines/cosmos

These are results with the 2B param model. Imagine what you could do with the 14B version! Go check it out now!

1 reply

·

a-r-r-o-w

posted an update about 2 months ago

Post

1317

Did you know how simple it was to get started with your own custom compiler backend with torch.compile? What's stopping you from writing your own compiler?

import torch
from torch._functorch.partitioners import draw_graph

def compiler(fx_module: torch.fx.GraphModule, _):
    draw_graph(fx_module, f"compile.dot")
    return fx_module.forward

def capture(model, *inputs):
    compiled_model = torch.compile(model, backend=compiler)
    y = compiled_model(*inputs)
    y.sum().backward()

class MLP(torch.nn.Module):
    def __init__(self):
        super().__init__()
        
        self.linear_1 = torch.nn.Linear(16, 32)
        self.linear_2 = torch.nn.Linear(32, 16)
    
    def forward(self, x):
        x = self.linear_1(x)
        x = torch.nn.functional.silu(x)
        x = self.linear_2(x)
        return x

if __name__ == '__main__':
    model = MLP()
    model.to("mps")
    x = torch.randn(4, 16, device="mps", dtype=torch.float32)

    capture(model, x)

--------------

Part of https://huggingface.co/posts/a-r-r-o-w/231008365980283

1 reply

·

diffusers-internal-dev

AI & ML interests

Recent Activity

diffusers-internal-dev/modular-flux.1-dev

diffusers-internal-dev/modular-flux.1-dev

diffusers-internal-dev/Modular-Wan-I2V-14B-720P-Diffusers

diffusers-internal-dev/Modular-Wan-I2V-14B-720P-Diffusers

diffusers-internal-dev/Modular-Wan-I2V-14B-480P-Diffusers

diffusers-internal-dev/Modular-Wan-I2V-14B-480P-Diffusers

diffusers-internal-dev/modular-wan-t2v

diffusers-internal-dev/modular-wan-t2v

diffusers-internal-dev/florence2-image-annotator-no-repo-info

diffusers-internal-dev/florence2-image-annotator-no-repo-info

diffusers-internal-dev/cosmos-predict2-text-2-image

diffusers-internal-dev/cosmos-predict2-text-2-image

diffusers-internal-dev/chroma-test

diffusers-internal-dev/chroma-test

AI & ML interests

Recent Activity

Team members 7

diffusers-internal-dev's activity