Slashing torch.compile Warmup & LoRA Swapping Times with Pruna - Pruna AI - Make your AI models cheaper, faster, smaller ...

Back to articles

Technical Article

Slashing torch.compile Warmup & LoRA Swapping Times with Pruna

Sep 9, 2025

John Rachwan

Cofounder & CTO

Johanna Sommer

ML Research Engineer

Bertrand Charpentier

Cofounder, President & Chief Scientist

Sara Han Díaz

DevRel Engineer

PyTorch introduced torch.compile, a powerful feature that significantly boosts performance by compiling the models. However, it comes with a catch: the first run is very slow. That warmup delay can be a drag on development iteration and can lead to slower cold starts in production. If you’ve ever swapped a LoRA or made a small model change, you’ve probably noticed that frustrating pause before things get moving again. But what if you could dramatically reduce, or even eliminate, these warmup delays?

In this post, we'll dive into two practical techniques, powered by Pruna, to mitigate warmup times. We'll show you how to:

Using Pruna's portable compilation feature, eliminate the initial model warmup when deploying or reloading a model on a new machine (with identical hardware).
Achieve zero warmup when switching LoRAs (Low-Rank Adaptations) on an already optimized model.

Get ready to reclaim those precious seconds (or even minutes!) and make your torch.compile experience smoother than ever.

The Challenge: Understanding `torch.compile` Warmup

Before we dive into the solutions, let's briefly touch upon why torch.compile has a warmup phase. When you first invoke a model compiled with torch.compile, several things happen under the hood. PyTorch needs to:

Capture the computational graph: It traces the execution of your model to understand its structure.
Perform graph optimizations: The captured graph is then optimized for better performance.
Detect and fuse operators: The backend (such as Inductor) identifies which operations can be combined for faster execution.
Generate code: Optimized code (often CUDA kernels for GPUs or efficient CPU code) is generated by the chosen backend (like Inductor).
Compile the code: This generated code is compiled into executable machine instructions.

This entire process, especially the code generation and compilation steps, can take a noticeable amount of time, ranging from seconds to minutes, depending on the model's complexity and the hardware. While this is a one-time cost for a given model shape and hardware (as the compiled artifacts are cached), it can be disruptive:

Start/Stop instances: When a new instance of an application starts (e.g., a serverless function or a new pod in Kubernetes), the first request might experience this long warmup, leading to poor user experience.
Switch instances: If you compile a model on one machine and then try to run it on another (even with identical hardware), the cache might not be directly usable, leading to another full warmup.
Switch model adapters: Swapping LoRAs or other adapters can alter the model graph, triggering recompilation.
Development Iteration: Waiting for recompilation after minor code changes or restarting a kernel slows the development cycle.

Pruna offers elegant ways to mitigate these issues, as we'll see next.

Use Case 1: Eliminating Initial Warmup with Pruna's Portable Compilation

The Problem

Traditionally, running a compiled model on a new machine triggers a full compilation warmup, even if the hardware is identical. This can slow down processes, especially when deploying models to production or sharing them with others.

The Core Idea

Pruna makes compilation portable. It saves the required artifacts so they can be easily packaged with your model and reused on another machine (with the same hardware architecture and CUDA drivers) without needing to recompile from scratch. That way, the model will run fast right from the first inference.

The Benefits

Faster deployment: Skip the first-run delay when deploying pre-compiled models to production servers, especially serverless instances.
Easier collaboration: Share ready-to-run models with your team.
Smoother pipelines: Speed up CI/CD by avoiding repeated compilation.

How to Use Pruna’s Portable Compilation

Let's walk through how to use this feature:

Load your model as normally: In our example, we use a Stable Diffusion pipeline from Diffusers.
Configure Pruna for Portable Compilation: This is where the magic happens. Create a SmashConfig object and configure torch_compile to be portable.
Smash the Model: Apply the configuration using smash().
Run and Save the Model: Run your model for the first time and trigger the compilation process, including the warmup. After that, just save your Pruna-smashed model, and it’ll be ready to use on any other machine.
import torch from diffusers import StableDiffusionPipeline from pruna import SmashConfig, smash # Load the model pipe = StableDiffusionPipeline.from_pretrained( "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16 ).to("cuda") # Configure torch.compile and combine it with # other Pruna features, as caching smash_config = SmashConfig() smash_config["compiler"] = "torch_compile" # Now the key setting! smash_config["torch_compile_make_portable"] = True smash_config["cacher"] = "deepcache" # Smash the model pipe = smash(pipe, smash_config=smash_config) # Run the model for the first time pipe("a photo of an astronaut riding a horse on mars") # Save the smashed model, including its portable # compilation artifacts pipe.save_pretrained("smashed_sd_portable_model/")

Use case 2: Zero Warmup for LoRA Switching with Diffusers Hotswap and Pruna (`torch.compile`) Compatibility

The Problem

Low-Rank Adaptation (LoRA) is a game-changer for efficiently fine-tuning large models. It allows for quick adaptation by training only a small set of parameters.

A powerful workflow involves dynamically switching between different LoRAs on a base model to change its output on the fly—for instance, altering image styles in a generative model. However, a challenge arises when you combine it with compilation. Every LoRA swap can look like a graph change—triggering a long recompilation and wiping out the speed advantage.

The Core Idea

While Diffusers handles the mechanics of LoRA hotswapping, using Pruna with torch.compile and leveraging one of its cachers ensures that these Diffusers-driven LoRA swaps are efficient and don't cause recompilation warmups after the initial model compilation.

The Benefits

With Pruna and Diffusers together, you get flexible LoRA adaptation and high-performance execution with no warmup delays.

Instant LoRA swaps: Serve models that adapt to diverse user inputs by loading different LoRAs or applications requiring rapid switching between LoRA-defined styles or functionalities (e.g., in an image generation UI), without the latency of recompilation.
Efficient experimentation: Test multiple LoRAs quickly without waiting for recompiles.

How to Leverage Diffusers Hotswap with Pruna for Zero Warmup

Let's walk through how this works:

Load the Base Model and Enable Diffusers LoRA Hotswapping.
Configure Pruna: Configure torch.compile and enable a cacher. In this example, we will be using the fora cacher, but others also maintain compatibility.
Smash the Model: Apply the configuration using smash().
Run the Model: Run the model for the first time, triggering the torch.compile warmup for the base model and the current LoRA. Then, you’ll be ready to hotswap to a new LoRA.

import torch
from diffusers import FluxPipeline
from pruna import SmashConfig, smash

# Load the base model and enable LoRA hotswapping
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16
).to("cuda")
pipe.enable_lora_hotswap(target_rank=128) # target_rank is an example

# Load an initial LoRA
pipe.load_lora_weights("alvdansen/frosting_lane_flux") # Example LoRA

# Configure Pruna's `torch.compile` and `fora`
smash_config = SmashConfig()
smash_config["compiler"] = "torch_compile"
smash_config["cacher"] = "fora"
smash_config["fora_interval"] = 2
smash_config["fora_start_step"] = 2
smash_config._prepare_saving = False # `False`for experimentation

# Smash the model
pipe = smash(
    model=pipe,
    smash_config=smash_config,
)

# Run the model for the first time
prompt ="a cat jumping in the air to catch a bird"
generator = torch.Generator("cpu").manual_seed(0)
pipe(prompt, num_inference_steps=28, generator=generator).images[0]

Comparing the Solutions: Portable Compilation vs. Pruna Cacher Compatibility

While we separately presented these use cases, they can be easily combined:

Use portable compilation to create a base smashed model (perhaps with a default LoRA and apply Pruna optimization that loads quickly on new instances.
Once loaded, pruna’s compatibility with hot-swapping would ensure that any subsequent LoRA hot swaps (managed by Diffusers) on that instance are also free of torch.compile warmup delays.

This combined approach would give you a fast cold start and adapter switching.

Conclusions: Reclaim Your Time with Pruna

The torch.compile warmup can slow down production workflows for cold starts and adapter switching. Pruna addresses these challenges with two key features:

Portable compilation (torch_compile_make_portable=True) removes first-run warmup when deploying to identical hardware, enabling immediate peak performance.
Diffusers' LoRA hotswapping with torch.compile and a Pruna cacher enables instant LoRA switching without recompilation delays.

For background on PyTorch's compilation and caching mechanisms, you might find the official PyTorch torch.compile Caching Tutorial insightful.

We hope this guide helps you optimize your torch.compile workflows. Happy coding!

Enjoy the Quality and Efficiency!

Want to take it further?

Compress your own models with Pruna and give us a ⭐ to show your support!
Try our Replicate endpoint with just one click.
Stay up to date with the latest AI efficiency research on our blog, explore our materials collection, or dive into our courses.
Join the conversation and stay updated in our Discord community.

Back to articles

・

Sep 9, 2025

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Install Pruna

Get a benchmark

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Install Pruna

Get a benchmark

Curious what Pruna can do for your models?

Whether you're running GenAI in production or exploring what's possible, Pruna makes it easier to move fast and stay efficient.

Install Pruna

Get a benchmark

The Challenge: Understanding torch.compile Warmup

Use Case 1: Eliminating Initial Warmup with Pruna's Portable Compilation

The Problem

The Core Idea

The Benefits

How to Use Pruna’s Portable Compilation

Use case 2: Zero Warmup for LoRA Switching with Diffusers Hotswap and Pruna (torch.compile) Compatibility

The Problem

The Core Idea

The Benefits

How to Leverage Diffusers Hotswap with Pruna for Zero Warmup

Comparing the Solutions: Portable Compilation vs. Pruna Cacher Compatibility

Conclusions: Reclaim Your Time with Pruna

Enjoy the Quality and Efficiency!

The Challenge: Understanding `torch.compile` Warmup

Use case 2: Zero Warmup for LoRA Switching with Diffusers Hotswap and Pruna (`torch.compile`) Compatibility