Technical Article
SOTA Optimization for Image-Generation with Pruna Open Source
Jun 24, 2025

Nils Fleischmann
ML Research Engineer

Bertrand Charpentier
Cofounder, President & Chief Scientist
Since releasing our open-source library in March, we’ve shipped a lot. Amongst other upgrades, we introduced three caching algorithms, a new quantizer, and QKV fusing. With these recent additions, Pruna sets new standards for compressing image-generation models in the open-source community, making.
FLUX.1-dev 4.2x faster
Stable Diffusion XL 3.6x faster
Stable Diffusion 3.5x faster
while preserving a high image quality and a high fidelity to the base model. In this blog post, you will learn about the optimization algorithms that enable these speedups and how to use them correctly.
To see what’s possible, we’ll focus on FLUX.1-dev. The chart from Artificial Analysis shows the fastest FLUX.1-dev endpoints available today. As noted in an earlier post, most endpoints already employ optimization techniques under the hood. Running FLUX.1-dev with default settings on an L40S GPU puts us in last place - by a wide margin. Let’s change that.

One Combination Compression Algorithm is Good, Multiple Compression Compression Algorithms is Great!
At Pruna, we believe you get the best results by combining different optimization algorithms. While a single technique might not speed up your model by 4×, two algorithms delivering a 2× boost can be combined to achieve that same 4× improvement.
Four main algorithm groups for optimizing 🤗 Diffusers pipelines: quantizers, compilers, cachers, and factorizers. The overview below shows which algorithms are currently supported in each category.

Cachers
As shown in the figure on the left, diffusion models generate images by starting with pure noise and gradually removing it over multiple inference steps until the final image emerges. At each step, a blurry image is fed into a neural network backbone (for example, a transformer), which predicts the noise that should be subtracted from the image.
Recent papers have shown that consecutive backbone passes share many similarities. In particular, the outputs of expensive operations within the backbone tend to remain almost the same from one step to the next. This finding motivates the use of caching: if these outputs only differ slightly, we can compute them once and reuse them in subsequent steps. The figure on the right illustrates the FORA caching algorithm, where the costly backbone operations are performed every n steps (n = 2 in the example), and their results are reused in the intervening steps.

Each caching algorithm provides a interval
hyperparameter to tune caching aggressiveness. An interval
of 3, for example, runs the backbone every third step and reuses the cached outputs for the two intervening steps.
For FLUX you can choose between three cachers: FORA, PAB, and FasterCache.
To understand how different cacher-interval configurations impact our model, we measured average inference time, ARNIQA, and LPIPS on the DrawBench prompt dataset:

The chart shows that the FORA cacher achieves the most considerable speedup while scoring highest on the ARNIQA metric. If you care about a very high fidelity to the base model, FasterCache might be the right choice for you. We can see the trade-off controlled by the interval parameter: as the interval increases, inference becomes faster but at the cost of quality. In fact, increasing the interval from 3 to 4 across all cachers causes a drop in quality.
In the example below, you can see how the image changes for FORA with different interval settings. Most differences are subtle, like the shape of the cheeks, but with more aggressive cache settings, the photos can also lose some sharpness.

Compilers
While caching reduces the number of times expensive operations in the backbone are computed, another way to obtain a speedup is to optimize these operations. Compilers analyze the computations in the backbone and determine which operations can be fused to make computations more efficient. The only drawback is that the first execution takes longer.
As the plots below show, compilation has virtually no impact on output quality. For FLUX, torch.compile achieves greater speedups than Stable Fast.

Quantizers
Quantization lowers the precision of the numbers used to represent a model’s parameters and calculations. Employing lower bit widths (for example, converting 16-bit floating-point values to 8-bit integers) reduces model size and speeds up inference.
Given its compute-intensive attention mechanism, FLUX benefits most from dynamic quantization, which quantizes both weights and activations. For instance, torchao's dynamic quantization can yield an additional speedup on top of torch.compile. Because modules such as normalization layers can be sensitive to dynamic quantization, we make it easy to exclude them. To get speedups with torchao, we have to use it with torch.compiles "max-autotune-no-cudagraphs" mode that will increase the cold start time compared to the default compile mode.
As seen in the plots below, applying dynamic quantization barely affects the quality while giving a speedup if these critical modules are filtered out correctly. Further, it cuts peak GPU memory usage from 34.7 GB to 28.0 GB. Note that we only quantize the transformer—not the VAE or text encoder—which explains why the overall memory savings are limited.

In the images below (interval=2), naive quantization produces a slightly less sharp image and alters subtle details—such as the cheeks and mouth—. In contrast, filtered quantization preserves sharpness and leaves the image virtually unchanged.

How to try Open-Source Compression Algorithms?
Great, we figured out that the following combination of algorithms works well for FLUX.1-dev. In case you are curious and want to try it out on your own, just follow these steps:

Pruna is available on PyPI, so you can install it using pip:
Once you have installed Pruna, you can run the snippet below. Feel free to play with the SmashConfig to better understand the different algorithms.
This configuration makes the model 4.2x faster on an L40S GPU. However, you might observe a different speedup if you run this on your GPU.
How to Compress Other Text-to-Image Models?
This kind of optimization applies not only to FLUX 1-dev but also to other image-generation models. Take, for example, Stable Diffusion models, which have gained considerable popularity in recent years. Because they use a different architecture, you may want to employ a different combination of algorithms to optimize them. We have found that combining DeepCache, StableFast, and QKV Diffusion works particularly well with Stable Diffusion models.

As the example below shows, this combination accelerates Stable Diffusion XL by 3.6× while keeping the image virtually unchanged.

You can try out this combination by running the following snippet:
Moving Forward
Looking back at the chart from Artificial Analysis, while other competitors used closed-source optimizations, using Pruna open-source saved us from last place and pushed us up to second, cutting inference latency to 3.8s. We managed that on an L40S GPU, while most leaderboards rely on one (or multiple) faster H100. Do you want to get the first place with more compression algorithms? Check out our Flux-juiced blog and the inference benchmark.

We are very proud that these state-of-the-art results are possible with our open-source version - and we’re just getting started. In the coming months, we plan to add new algorithms that might unlock even more powerful combinations. If you’re curious, check out our repository to be among the first to try the latest techniques. Stay tuned!


・
Jun 24, 2025
Subscribe to Pruna's Newsletter