Technical Article

SOTA Optimization for Image-Generation with Pruna Open Source

Jun 24, 2025

Nils Fleischmann

ML Research Engineer

Bertrand Charpentier

Bertrand Charpentier

Cofounder, President & Chief Scientist

Since releasing our open-source library in March, we’ve shipped a lot. Amongst other upgrades, we introduced three caching algorithms, a new quantizer, and QKV fusing. With these recent additions, Pruna sets new standards for compressing image-generation models in the open-source community, making.

while preserving a high image quality and a high fidelity to the base model. In this blog post, you will learn about the optimization algorithms that enable these speedups and how to use them correctly.

To see what’s possible, we’ll focus on FLUX.1-dev. The chart from Artificial Analysis shows the fastest FLUX.1-dev endpoints available today. As noted in an earlier post, most endpoints already employ optimization techniques under the hood. Running FLUX.1-dev with default settings on an L40S GPU puts us in last place - by a wide margin. Let’s change that.

One Combination Compression Algorithm is Good, Multiple Compression Compression Algorithms is Great!

At Pruna, we believe you get the best results by combining different optimization algorithms. While a single technique might not speed up your model by 4×, two algorithms delivering a 2× boost can be combined to achieve that same 4× improvement.

Four main algorithm groups for optimizing 🤗 Diffusers pipelines: quantizers, compilers, cachers, and factorizers. The overview below shows which algorithms are currently supported in each category.

Cachers

As shown in the figure on the left, diffusion models generate images by starting with pure noise and gradually removing it over multiple inference steps until the final image emerges. At each step, a blurry image is fed into a neural network backbone (for example, a transformer), which predicts the noise that should be subtracted from the image.

Recent papers have shown that consecutive backbone passes share many similarities. In particular, the outputs of expensive operations within the backbone tend to remain almost the same from one step to the next. This finding motivates the use of caching: if these outputs only differ slightly, we can compute them once and reuse them in subsequent steps. The figure on the right illustrates the FORA caching algorithm, where the costly backbone operations are performed every n steps (n = 2 in the example), and their results are reused in the intervening steps.

Each caching algorithm provides a interval hyperparameter to tune caching aggressiveness. An interval of 3, for example, runs the backbone every third step and reuses the cached outputs for the two intervening steps.

For FLUX you can choose between three cachers: FORA, PAB, and FasterCache.

pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16).to("cuda")

cacher = "fora"  # or "pab" or "fastercache"
interval = 2  # 3, 4

smash_config = SmashConfig()
smash_config["cacher"] = cacher
smash_config[f"{cacher}_interval"] = interval
smashed_pipe = smash(pipe, smash_config)

smashed_pipe("a knitted purple prune").images[0]

To understand how different cacher-interval configurations impact our model, we measured average inference time, ARNIQA, and LPIPS on the DrawBench prompt dataset:

The chart shows that the FORA cacher achieves the most considerable speedup while scoring highest on the ARNIQA metric. If you care about a very high fidelity to the base model, FasterCache might be the right choice for you. We can see the trade-off controlled by the interval parameter: as the interval increases, inference becomes faster but at the cost of quality. In fact, increasing the interval from 3 to 4 across all cachers causes a drop in quality.

In the example below, you can see how the image changes for FORA with different interval settings. Most differences are subtle, like the shape of the cheeks, but with more aggressive cache settings, the photos can also lose some sharpness.

Compilers

While caching reduces the number of times expensive operations in the backbone are computed, another way to obtain a speedup is to optimize these operations. Compilers analyze the computations in the backbone and determine which operations can be fused to make computations more efficient. The only drawback is that the first execution takes longer.

compiler = "torch_compile"  # or "stable_fast"

smash_config = SmashConfig()
smash_config["cacher"] = "fora"
smash_config["fora_interval"] = 2  # 3, 4
smash_config["compiler"] = compiler
smashed_pipe = smash(pipe, smash_config

As the plots below show, compilation has virtually no impact on output quality. For FLUX, torch.compile achieves greater speedups than Stable Fast.

Quantizers

Quantization lowers the precision of the numbers used to represent a model’s parameters and calculations. Employing lower bit widths (for example, converting 16-bit floating-point values to 8-bit integers) reduces model size and speeds up inference.

Given its compute-intensive attention mechanism, FLUX benefits most from dynamic quantization, which quantizes both weights and activations. For instance, torchao's dynamic quantization can yield an additional speedup on top of torch.compile. Because modules such as normalization layers can be sensitive to dynamic quantization, we make it easy to exclude them. To get speedups with torchao, we have to use it with torch.compiles "max-autotune-no-cudagraphs" mode that will increase the cold start time compared to the default compile mode.

smash_config = SmashConfig()
smash_config["cacher"] = "fora"
smash_config["fora_interval"] = 2  # 3, 4
smash_config["compiler"] = "torch_compile"
smash_config["torch_compile_mode"] = "max-autotune-no-cudagraphs"
smash_config["quantizer"] = "torchao"
smash_config["torchao_quant_type"] = "int8dq"
smash_config["torchao_excluded_modules"] = "norm+embedding"  # or "none"
smashed_pipe = smash(pipe, smash_config)

As seen in the plots below, applying dynamic quantization barely affects the quality while giving a speedup if these critical modules are filtered out correctly. Further, it cuts peak GPU memory usage from 34.7 GB to 28.0 GB. Note that we only quantize the transformer—not the VAE or text encoder—which explains why the overall memory savings are limited.

In the images below (interval=2), naive quantization produces a slightly less sharp image and alters subtle details—such as the cheeks and mouth—. In contrast, filtered quantization preserves sharpness and leaves the image virtually unchanged.

How to try Open-Source Compression Algorithms?

Great, we figured out that the following combination of algorithms works well for FLUX.1-dev. In case you are curious and want to try it out on your own, just follow these steps:

Pruna is available on PyPI, so you can install it using pip:

pip install pruna

Once you have installed Pruna, you can run the snippet below. Feel free to play with the SmashConfig to better understand the different algorithms.

import torch
from diffusers import FluxPipeline

from pruna import SmashConfig, smash

pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16).to("cuda")

smash_config = SmashConfig()
smash_config["cacher"] = "fora"
smash_config["fora_interval"] = 3  # or 2 for even faster inference
smash_config["compiler"] = "torch_compile"
smash_config["torch_compile_mode"] = "max-autotune-no-cudagraphs"
smash_config["quantizer"] = "torchao"
smash_config["torchao_quant_type"] = "int8dq" # you can also try fp8dq
smash_config["torchao_excluded_modules"] = "norm+embedding"
smashed_pipe = smash(pipe, smash_config)

smashed_pipe("a knitted purple prune").images[0]

This configuration makes the model 4.2x faster on an L40S GPU. However, you might observe a different speedup if you run this on your GPU.

How to Compress Other Text-to-Image Models?

This kind of optimization applies not only to FLUX 1-dev but also to other image-generation models. Take, for example, Stable Diffusion models, which have gained considerable popularity in recent years. Because they use a different architecture, you may want to employ a different combination of algorithms to optimize them. We have found that combining DeepCache, StableFast, and QKV Diffusion works particularly well with Stable Diffusion models.

As the example below shows, this combination accelerates Stable Diffusion XL by 3.6× while keeping the image virtually unchanged.

You can try out this combination by running the following snippet:

from diffusers import DiffusionPipeline
import torch

from pruna import SmashConfig, smash

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
).to("cuda")

smash_config = SmashConfig()
smash_config["cacher"] = "deepcache"
smash_config["deepcache_interval"] = 3  # or 2
smash_config["compiler"] = "stable_fast"
smash_config["factorizer"] = "qkv_diffusers"
smashed_pipe = smash(pipe, smash_config)

smashed_pipe("A beautiful castle beside a waterfall in the woods, by Josef Thoma, matte painting, trending on artstation HQ").images[0]

Moving Forward

Looking back at the chart from Artificial Analysis, while other competitors used closed-source optimizations, using Pruna open-source saved us from last place and pushed us up to second, cutting inference latency to 3.8s. We managed that on an L40S GPU, while most leaderboards rely on one (or multiple) faster H100. Do you want to get the first place with more compression algorithms? Check out our Flux-juiced blog and the inference benchmark.

We are very proud that these state-of-the-art results are possible with our open-source version - and we’re just getting started. In the coming months, we plan to add new algorithms that might unlock even more powerful combinations. If you’re curious, check out our repository to be among the first to try the latest techniques. Stay tuned!

Subscribe to Pruna's Newsletter

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.