Technical Article

Optimization Agent: Automatically Finding the Best AI Compression!

May 8, 2025

Johanna Sommer

ML Research Engineer

Bertrand Charpentier

Bertrand Charpentier

Cofounder, President & Chief Scientist

When it comes to scaling AI efficiently, configuration choices can make—or break—your model’s performance. From hardware constraints to model-specific requirements, the best approach often shifts depending on your expectations or deployment. At Pruna, we understand these challenges, which is why we’re building tools to help you navigate this complexity with confidence. In this blog, we’ll walk you through how to find the best configuration for any model, in this case Flux as a concrete example, with our new Optimization Agent feature.

Doing things manually: some of the pain, some of the gain

If you’ve explored our smashing tutorials, you may have encountered the SmashConfig object—our powerful and flexible way to control how models get optimized. After loading the model, the SmashConfig allows to define the compression configuration to apply to our model. In this example, it uses the flux_caching cacher:

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16).to("cuda")
from pruna_pro import SmashConfig, smash

# Initialize the SmashConfig
smash_config = SmashConfig()
smash_config["cacher"] = "flux_caching"
smashed_model = smash(model=base_model, smash_config=smash_config)

While it offers immense control, understanding which compression configuration can improve performance can be daunting. That’s exactly why we built the **Optimization Agent —** to take the guesswork out of the process and deliver performance gains tailored to your exact target use case.

No pain and all of the gains with the Optimization Agent

The Optimization Agent can be run in few lines of code:

from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.task import Task

data_module = PrunaDataModule.from_string('LAION256')
task = Task(["psnr", "elapsed_time"])
from pruna_pro import OptimizationAgent

optimization_agent = OptimizationAgent(model=base_model, task=task)
optimization_agent.probabilistic_search(n_trials=10)

Let’s break down what happened here.

Step 1: Defines your objectives. In the first code snippet, we only need to create a Task object which defines your desired objectives that you would like to optimize. In this example, it uses an **efficiency metric,** elapsed_time, and a **quality metric,** the psnr. To ensure these metrics are computed in the right context, we also specify the data module that best represents your data domain. Simply defining your end objectives removes any ambiguity around which optimization methods to use—you simply define what matters most to you, and the Optimization Agent takes care of the rest!

  • Step 2: Let the Optimization Agent work. In the second code snippet, we instantiate the Optimization Agent with the base model and the Task we just defined. Depending on the time you would like the Optimization Agent to spend, you can then use the instant_search method or use the probabilistic_search method with more or less n_trials to explore, finds and apply the most performant compression methods. These search methods then launch sophisticated optimization processes based designed to discover the ideal balance between speed and model performance. The result? A model pipeline that’s carefully tailored to meet both your efficiency and quality goals—no trade-offs required.

So what does the compression configuration found by the Optimization Agent look like?

SmashConfig(
  'cacher': 'taylor_auto',
  'compiler': 'torch_compile',
  'taylor_auto_max_order': 1,
  'taylor_auto_speed_factor': 0.5,
  'torch_compile_backend': 'inductor',
  'torch_compile_dynamic': None,
  'torch_compile_fullgraph': True,
  'torch_compile_mode': 'default',
)

The Optimization Agent has concluded that combining taylor-auto caching paired with the torch compiler offers the best performance with selected caching and compilation hyperparameters strike an ideal balance for the specified target model, hardware, and metrics.

When we compare the results of the manually smashed model to the base model - we can see a significant improvement in the elapsed time at a good PNSR score. However, the Optimization Agent was able to find a configuration with even better PNSR, generated images closer to the base model and record-speed inference time!

With the Optimization Agent, configuring and deploying efficient AI models no longer requires expert-level tuning or time-consuming experimentation. Whether you’re optimizing for speed, memory, quality, or the sweet spot between all three, this intelligent tool transparently figures it out for you.

Subscribe to Pruna's Newsletter

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

© 2025 Pruna AI - Built with Pretzels & Croissants 🥨 🥐

© 2025 Pruna AI - Built with Pretzels & Croissants

© 2025 Pruna AI - Built with Pretzels & Croissants 🥨 🥐