How Pruna Delivered Up to 3.6x Faster Inference for BRIA 3.2 - Pruna AI - Make your AI models cheaper, faster, smaller ...

Back to articles

Technical Article

・

Case study

How Pruna Delivered Up to 3.6x Faster Inference for BRIA 3.2

Jun 26, 2025

Quentin Sinig

Go-to-Market Lead

Nils Fleischmann

ML Research Engineer

Bertrand Charpentier

Cofounder, President & Chief Scientist

Today’s a big day: BRIA just open-sourced BRIA 3.2 weights! If you haven’t tried their model yet, head over to their Hugging Face repo. And it’s the perfect moment for us to finally share the inside story of how we worked with BRIA to optimize the BRIA 3.2 model for inference.

TL;DR

When BRIA first came to us to help optimize their model in early 2025, the goal was clear: make BRIA 3.1 faster without compromising the image quality their creative teams care so much about.

By the time BRIA 3.2 launched publicly at NVIDIA GTC Paris x VivaTech, Pruna delivered multiple runtime configurations offering 2x to 3.6x faster inference, with quality evaluation validated by BRIA’s in-house visual team.

With many options, Pruna is easy to use and has an amazing customer service!” — Tair Chamiel, Software Engineer at BRIA

March 2025: When we first met BRIA

Our collaboration started with a simple question: “Can you take a look at BRIA 3.1?”

After getting access the gated model and running some initial tests, we saw clear potential, not just for raw speed, but for clean optimization that preserved BRIA’s visual quality standards. We suggested running a full benchmark to properly evaluate tradeoffs across multiple compression configurations.

Using BRIA’s custom evaluation dataset, we tested several configurations across 30 and 50 inference steps on 1024×1024 images, including:

Baseline (unoptimized)
torch.compile and Stable Fast
Pruna Juiced (optimized with negligible quality tradeoffs)
Pruna Extra-Juiced (maximum speed, slight quality loss)

All tests were run on L40S GPUs (i.e. AWS g6e.2xlarge instances), the same hardware family that BRIA uses in production, so results were apples to apples. For quality, we used LPIPS (captures perceptual similarity as humans see it), SSIM (compares image structure and texture), and PSNR (detects pixel-level distortion) — three metrics that together gave a reliable picture of visual fidelity between the base and optimized model.

The result? With Pruna enabled, BRIA 3.1 inference ran ~2.5x faster (30 steps) and passed BRIA’s Creative Team review for image fidelity.

We thought that was the win. But the best part came later.

June 2025: BRIA 3.2 and the GTC moment

Ahead of the BRIA 3.2 public release at NVIDIA GTC x VivaTech 2025, BRIA asked us to re-run and extend the benchmark with the latest models and newer optimization features. Fun fact: this happened 1 week before the event, but we were up for the challenge!

Alongside the two compilers we evaluated in our previous benchmark (torch.compile and Stable Fast), we also examined three quantization schemes: INT8 weight-only quantization, INT8/FP8 activation/weight quantization. Compared to the optimizations we proposed for BRIA-3.1, the optimized version of BRIA-3.2 is both faster (2× speedup instead of 1.5×) and preserves higher image fidelity (LPIPS of 0.10 instead of 0.14).

Final numbers on BRIA 3.2

✅ Up to 3.6x faster inference

🎯 Multiple runtime profiles (2.0x, 2.6x, 2.9x and 3.6x speedups)

🖼️ Quality preserved (LPIPS as low as 0.10)

🧪 Evaluated over all their selected evaluation prompts.

Sharing our learnings with the community

We thought sharing these extra technical learnings might be useful, part of our mission to build and support a growing AI efficiency community. So for the nerdiest among you (you know who you are!), here are the insights we picked up along the way:

torch.compile also provides benefits like persisting the inductor cache, which can significantly reduce warm-up time. We see no reason to use Stable Fast anymore; torch.compile is our default going forward.
Quantization generally leads to lower fidelity, but that doesn’t always mean lower quality. In our visual inspections, the quantized outputs still looked great.
Weight-only quantization does not offer a significant speedup over other quantization schemes, so we’ve dropped it from consideration.
Both dynamic quantization schemes (INT8 and FP8) provide significant speedups. FP8 is slightly faster, but comes with lower fidelity and, when combined with caching, occasional visual artifacts. So we chose to stick with INT8 activation/weight quantization.

Try it on your own setup

By default, the Bria weights on Hugging Face do not come with inference optimizations. However, if you are interested in reproducing these numbers, you can do it by loading the model and compressing it with this snippet using **Pruna Pro :)**

import torch
from huggingface_hub import hf_hub_download
from pruna_pro import SmashConfig, smash

# download BRIA-3.2
repo_id = "briaai/BRIA-3.2"
hf_hub_download(repo_id=repo_id, filename="pipeline_bria.py")
hf_hub_download(repo_id=repo_id, filename="transformer_bria.py")
hf_hub_download(repo_id=repo_id, filename="bria_utils.py")

from pipeline_bria import BriaPipeline

pipe = BriaPipeline.from_pretrained(
    "briaai/BRIA-3.2",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to(device="cuda")

# smash BRIA-3.2 with Pruna Pro
smash_config = SmashConfig()
smash_config["compiler"] = "torch_compile"
smash_config["torch_compile_target"] = "module_list"
smash_config["cacher"] = "auto"
smash_config["auto_cache_mode"] = "taylor"
smash_config["auto_speed_factor"] = 0.7  # 0.5 for even faster inference
smash_config._prepare_saving = False
pipe = smash(pipe, smash_config, experimental=True)

# run inference with smashed pipe
prompt = "A portrait of a Beautiful and playful ethereal singer, golden designs, highly detailed, blurry background"
negative_prompt = "Logo,Watermark,Ugly,Morbid,Extra fingers,Poorly drawn hands,Mutation,Blurry,Extra limbs,Gross proportions,Missing arms,Mutated hands,Long neck,Duplicate,Mutilated,Mutilated hands,Poorly drawn face,Deformed,Bad anatomy,Cloned face,Malformed limbs,Missing legs,Too many fingers"
images = pipe(prompt=prompt, negative_prompt=negative_prompt, height=1024, width=1024).images[0]

If you don’t have a Pro account yet, you can get one here. No commitment, it’s just $0.40/hour per GPU to get started, and we promise: you’ll get the same hands-on support we gave BRIA. 😉

Wrapping up

This collaboration was genuinely great. Again, a shoutout to the BRIA team for their work and commitment to an open ecosystem.

From benchmarking dozens of variants to optimizing for launch-week deadlines, we learned that finding the right combination of compilers, quantizers, and caching takes deep expertise. And that’s exactly what we bring, both through the Pruna package and the research team behind it.

If you're releasing or re-training a foundation model, or launching an app using open-weight models and care about latency and costs, we’re here to help.

👉 Check out the BRIA 3.2 model card

👉 Talk to us or try it yourself with Pruna Open-Source

Back to articles

・

Jun 26, 2025

Subscribe to Pruna's Newsletter

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Install Pruna

Get a benchmark

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Install Pruna

Get a benchmark

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Install Pruna

Get a benchmark