Technical Article
Slashing torch.compile Warmup & LoRA Swapping Times with Pruna
Sep 9, 2025

John Rachwan
Cofounder & CTO

Johanna Sommer
ML Research Engineer

Bertrand Charpentier
Cofounder, President & Chief Scientist

Sara Han Díaz
DevRel Engineer
PyTorch introduced torch.compile, a powerful feature that significantly boosts performance by compiling the models. However, it comes with a catch: the first run is very slow. That warmup delay can be a drag on development iteration and can lead to slower cold starts in production. If you’ve ever swapped a LoRA or made a small model change, you’ve probably noticed that frustrating pause before things get moving again. But what if you could dramatically reduce, or even eliminate, these warmup delays?
In this post, we'll dive into two practical techniques, powered by Pruna, to mitigate warmup times. We'll show you how to:
Using Pruna's portable compilation feature, eliminate the initial model warmup when deploying or reloading a model on a new machine (with identical hardware).
Achieve zero warmup when switching LoRAs (Low-Rank Adaptations) on an already optimized model.
Get ready to reclaim those precious seconds (or even minutes!) and make your torch.compile experience smoother than ever.
The Challenge: Understanding torch.compile Warmup
Before we dive into the solutions, let's briefly touch upon why torch.compile has a warmup phase. When you first invoke a model compiled with torch.compile, several things happen under the hood. PyTorch needs to:
Capture the computational graph: It traces the execution of your model to understand its structure.
Perform graph optimizations: The captured graph is then optimized for better performance.
Detect and fuse operators: The backend (such as Inductor) identifies which operations can be combined for faster execution.
Generate code: Optimized code (often CUDA kernels for GPUs or efficient CPU code) is generated by the chosen backend (like Inductor).
Compile the code: This generated code is compiled into executable machine instructions.
This entire process, especially the code generation and compilation steps, can take a noticeable amount of time, ranging from seconds to minutes, depending on the model's complexity and the hardware. While this is a one-time cost for a given model shape and hardware (as the compiled artifacts are cached), it can be disruptive:
Start/Stop instances: When a new instance of an application starts (e.g., a serverless function or a new pod in Kubernetes), the first request might experience this long warmup, leading to poor user experience.
Switch instances: If you compile a model on one machine and then try to run it on another (even with identical hardware), the cache might not be directly usable, leading to another full warmup.
Switch model adapters: Swapping LoRAs or other adapters can alter the model graph, triggering recompilation.
Development Iteration: Waiting for recompilation after minor code changes or restarting a kernel slows the development cycle.
Pruna offers elegant ways to mitigate these issues, as we'll see next.
Use Case 1: Eliminating Initial Warmup with Pruna's Portable Compilation
The Problem
Traditionally, running a compiled model on a new machine triggers a full compilation warmup, even if the hardware is identical. This can slow down processes, especially when deploying models to production or sharing them with others.
The Core Idea
Pruna makes compilation portable. It saves the required artifacts so they can be easily packaged with your model and reused on another machine (with the same hardware architecture and CUDA drivers) without needing to recompile from scratch. That way, the model will run fast right from the first inference.
The Benefits
Faster deployment: Skip the first-run delay when deploying pre-compiled models to production servers, especially serverless instances.
Easier collaboration: Share ready-to-run models with your team.
Smoother pipelines: Speed up CI/CD by avoiding repeated compilation.
How to Use Pruna’s Portable Compilation
Let's walk through how to use this feature:
Load your model as normally: In our example, we use a Stable Diffusion pipeline from Diffusers.
Configure Pruna for Portable Compilation: This is where the magic happens. Create a
SmashConfigobject and configuretorch_compileto be portable.Smash the Model: Apply the configuration using
smash().Run and Save the Model: Run your model for the first time and trigger the compilation process, including the warmup. After that, just save your Pruna-smashed model, and it’ll be ready to use on any other machine.
Use case 2: Zero Warmup for LoRA Switching with Diffusers Hotswap and Pruna (torch.compile) Compatibility
The Problem
Low-Rank Adaptation (LoRA) is a game-changer for efficiently fine-tuning large models. It allows for quick adaptation by training only a small set of parameters.
A powerful workflow involves dynamically switching between different LoRAs on a base model to change its output on the fly—for instance, altering image styles in a generative model. However, a challenge arises when you combine it with compilation. Every LoRA swap can look like a graph change—triggering a long recompilation and wiping out the speed advantage.
The Core Idea
While Diffusers handles the mechanics of LoRA hotswapping, using Pruna with torch.compile and leveraging one of its cachers ensures that these Diffusers-driven LoRA swaps are efficient and don't cause recompilation warmups after the initial model compilation.
The Benefits
With Pruna and Diffusers together, you get flexible LoRA adaptation and high-performance execution with no warmup delays.
Instant LoRA swaps: Serve models that adapt to diverse user inputs by loading different LoRAs or applications requiring rapid switching between LoRA-defined styles or functionalities (e.g., in an image generation UI), without the latency of recompilation.
Efficient experimentation: Test multiple LoRAs quickly without waiting for recompiles.
How to Leverage Diffusers Hotswap with Pruna for Zero Warmup
Let's walk through how this works:
Load the Base Model and Enable Diffusers LoRA Hotswapping.
Configure Pruna: Configure
torch.compileand enable a cacher. In this example, we will be using theforacacher, but others also maintain compatibility.Smash the Model: Apply the configuration using
smash().Run the Model: Run the model for the first time, triggering the
torch.compilewarmup for the base model and the current LoRA. Then, you’ll be ready to hotswap to a new LoRA.
Comparing the Solutions: Portable Compilation vs. Pruna Cacher Compatibility
While we separately presented these use cases, they can be easily combined:
Use portable compilation to create a base smashed model (perhaps with a default LoRA and apply Pruna optimization that loads quickly on new instances.
Once loaded, pruna’s compatibility with hot-swapping would ensure that any subsequent LoRA hot swaps (managed by Diffusers) on that instance are also free of
torch.compilewarmup delays.
This combined approach would give you a fast cold start and adapter switching.
Conclusions: Reclaim Your Time with Pruna
The torch.compile warmup can slow down production workflows for cold starts and adapter switching. Pruna addresses these challenges with two key features:
Portable compilation (
torch_compile_make_portable=True) removes first-run warmup when deploying to identical hardware, enabling immediate peak performance.Diffusers' LoRA hotswapping with
torch.compileand a Pruna cacher enables instant LoRA switching without recompilation delays.
For background on PyTorch's compilation and caching mechanisms, you might find the official PyTorch
torch.compileCaching Tutorial insightful.
We hope this guide helps you optimize your torch.compile workflows. Happy coding!
Enjoy the Quality and Efficiency!
Want to take it further?
Compress your own models with Pruna and give us a ⭐ to show your support!
Try our Replicate endpoint with just one click.
Stay up to date with the latest AI efficiency research on our blog, explore our materials collection, or dive into our courses.
Join the conversation and stay updated in our Discord community.




・
Sep 9, 2025
