Technical Article
・
Case study
How Pruna Delivered Up to 3.6x Faster Inference for BRIA 3.2
Jun 26, 2025

Quentin Sinig
Go-to-Market Lead

Nils Fleischmann
ML Research Engineer

Bertrand Charpentier
Cofounder, President & Chief Scientist
Today’s a big day: BRIA just open-sourced BRIA 3.2 weights! If you haven’t tried their model yet, head over to their Hugging Face repo. And it’s the perfect moment for us to finally share the inside story of how we worked with BRIA to optimize the BRIA 3.2 model for inference.
TL;DR
When BRIA first came to us to help optimize their model in early 2025, the goal was clear: make BRIA 3.1 faster without compromising the image quality their creative teams care so much about.
By the time BRIA 3.2 launched publicly at NVIDIA GTC Paris x VivaTech, Pruna delivered multiple runtime configurations offering 2x to 3.6x faster inference, with quality evaluation validated by BRIA’s in-house visual team.
With many options, Pruna is easy to use and has an amazing customer service!” — Tair Chamiel, Software Engineer at BRIA
March 2025: When we first met BRIA
Our collaboration started with a simple question: “Can you take a look at BRIA 3.1?”
After getting access the gated model and running some initial tests, we saw clear potential, not just for raw speed, but for clean optimization that preserved BRIA’s visual quality standards. We suggested running a full benchmark to properly evaluate tradeoffs across multiple compression configurations.
Using BRIA’s custom evaluation dataset, we tested several configurations across 30 and 50 inference steps on 1024×1024 images, including:
Baseline (unoptimized)
torch.compile and Stable Fast
Pruna Juiced (optimized with negligible quality tradeoffs)
Pruna Extra-Juiced (maximum speed, slight quality loss)
All tests were run on L40S GPUs (i.e. AWS g6e.2xlarge instances), the same hardware family that BRIA uses in production, so results were apples to apples. For quality, we used LPIPS (captures perceptual similarity as humans see it), SSIM (compares image structure and texture), and PSNR (detects pixel-level distortion) — three metrics that together gave a reliable picture of visual fidelity between the base and optimized model.
The result? With Pruna enabled, BRIA 3.1 inference ran ~2.5x faster (30 steps) and passed BRIA’s Creative Team review for image fidelity.

We thought that was the win. But the best part came later.
June 2025: BRIA 3.2 and the GTC moment
Ahead of the BRIA 3.2 public release at NVIDIA GTC x VivaTech 2025, BRIA asked us to re-run and extend the benchmark with the latest models and newer optimization features. Fun fact: this happened 1 week before the event, but we were up for the challenge!
Alongside the two compilers we evaluated in our previous benchmark (torch.compile and Stable Fast), we also examined three quantization schemes: INT8 weight-only quantization, INT8/FP8 activation/weight quantization. Compared to the optimizations we proposed for BRIA-3.1, the optimized version of BRIA-3.2 is both faster (2× speedup instead of 1.5×) and preserves higher image fidelity (LPIPS of 0.10 instead of 0.14).

Final numbers on BRIA 3.2
✅ Up to 3.6x faster inference
🎯 Multiple runtime profiles (2.0x, 2.6x, 2.9x and 3.6x speedups)
🖼️ Quality preserved (LPIPS as low as 0.10)
🧪 Evaluated over all their selected evaluation prompts.
Sharing our learnings with the community
We thought sharing these extra technical learnings might be useful, part of our mission to build and support a growing AI efficiency community. So for the nerdiest among you (you know who you are!), here are the insights we picked up along the way:

torch.compile also provides benefits like persisting the inductor cache, which can significantly reduce warm-up time. We see no reason to use Stable Fast anymore; torch.compile is our default going forward.
Quantization generally leads to lower fidelity, but that doesn’t always mean lower quality. In our visual inspections, the quantized outputs still looked great.
Weight-only quantization does not offer a significant speedup over other quantization schemes, so we’ve dropped it from consideration.
Both dynamic quantization schemes (INT8 and FP8) provide significant speedups. FP8 is slightly faster, but comes with lower fidelity and, when combined with caching, occasional visual artifacts. So we chose to stick with INT8 activation/weight quantization.
Try it on your own setup
By default, the Bria weights on Hugging Face do not come with inference optimizations. However, if you are interested in reproducing these numbers, you can do it by loading the model and compressing it with this snippet using **Pruna Pro :)**
If you don’t have a Pro account yet, you can get one here. No commitment, it’s just $0.40/hour per GPU to get started, and we promise: you’ll get the same hands-on support we gave BRIA. 😉
Wrapping up
This collaboration was genuinely great. Again, a shoutout to the BRIA team for their work and commitment to an open ecosystem.
From benchmarking dozens of variants to optimizing for launch-week deadlines, we learned that finding the right combination of compilers, quantizers, and caching takes deep expertise. And that’s exactly what we bring, both through the Pruna package and the research team behind it.
If you're releasing or re-training a foundation model, or launching an app using open-weight models and care about latency and costs, we’re here to help.
👉 Check out the BRIA 3.2 model card
👉 Talk to us or try it yourself with Pruna Open-Source



・
Jun 26, 2025
Subscribe to Pruna's Newsletter