Inside Story: The Pruna × Replicate Recipe for the Fastest Inference - Pruna AI - Make your AI models cheaper, faster, smaller ...

Back to articles

Performance Models

Inside Story: The Pruna × Replicate Recipe for the Fastest Inference

Sep 29, 2025

Quentin Sinig

Go-to-Market Lead

John Rachwan

Cofounder & CTO

Pruna helps inference providers run faster, cheaper, and better by delivering unmatched efficiency for their endpoints. Replicate is a platform for hosting and scaling models, where developers can use the latest AI with just an API call. Together, we form the perfect match: Replicate brings the models to millions, and Pruna makes them truly usable at scale. Now, what’s the story behind us working together? Time to read!

March to May 2025: From Crash Tests to Momentum

The first interactions date back to March 2025. At the time, Replicate was curious about our optimization stack and how it could fit into their platform. That curiosity came just as we were preparing to open-source the Pruna package, which we can call “a piece of perfect timing,” especially since transparency is core to their values. Demonstrating our optimization stack without hiding behind proprietary code set the foundation for trust.

The first “crash test” was “Flux Cheetah“ (now called “Flux Juiced”). Replicate gave us access to an H100, and we proved that our tech could deliver measurable gains (0.5s for a 512-image). That was the first time we showed them: yes, our optimizations actually make a difference. Bonus point: we made our findings public!

In April, the Hidream models dropped. Within 6 hours of release, we shipped an optimized endpoint delivering 1.3× to 2.5× speed-ups. Replicate became the first provider to serve hidream-l1, and our version was featured alongside leading model providers. The traction was immediate: 300,000 runs in the first week (and today it still accumulates to 3M runs).

By May, momentum shifted with VACE. Two key things happened from Hidream’s success: we opened a shared Slack channel where engineers from both sides could collaborate directly, and Replicate tested our optimizations themselves. In parallel, we shipped an optimized VACE endpoint, which was once again featured on their platform. That moment marked the transition from “trying things out” to “let’s actually build together.”

Replicate and Pruna decide to build together

Why Replicate Bet on Pruna’s Optimizations?

Replicate has a dedicated Models team, and their mission is clear: get the latest models on the platform. But in this space, winning isn’t just about getting the model out there. It’s about shipping fast and truly high-performing models. A model that runs but is too slow or expensive isn’t usable.

That’s where we fit:

Time-to-market made simple. Their platform and our toolkit are both designed as easy-to-use software layers. That synergy made integration straightforward and time-to-market incredibly short.
Results on day one. From Flux to Qwen, we repeatedly showed that our optimizations are the best in the market. They directly improved performance (both in speed and/or memory), and, most importantly, without quality loss.
Continuous improvement. We don’t rest on early wins. Whenever we develop a new optimization technique, we fold it back into the endpoints to push for further speed-ups. We also actively monitor Replicate’s Discord, listening to user feedback and quickly fixing issues when they arise.

This spirit of continuous improvement also extended to the Replicate Discord, where we worked hand-in-hand with their users. From fixing pipeline issues to adding features like “negative prompts,” “LoRAs,” or “higher FPS,” we usually shipped changes within hours. That quick loop became part of the recipe for success: a platform that scales + optimizations that adapt fast.

How Optimizations Go Live on Replicate

Replicate uses Cog, their open-source packaging system for ML inference. Think Docker, but for machine learning. It allows anyone to define how a model is loaded and how it should run (inputs, outputs, dependencies…) and then replicate it across machines.

We were already “fluent in Cog” since we’d been pushing our own models to Replicate. We didn’t want any walls between our teams, so we built a shared organization on Replicate. Both sides use the same account, which means every new model or update is instantly available. There are no custom pipelines or extra overhead, just a one-push process to get optimized endpoints live.

PS: for other providers, we also built a simple script that sets up the pipeline, downloads the optimized model, loads an input, and generates an output.

On top of that, the Replicate Discord community became an unexpected superpower. Users (amplified by Replicate’s Customer Engineering Team) constantly shared feedback: “Can you add negative prompts?” — yes. “Support LoRAs?” — yes. “Improve the FPS?” — yes. Almost always delivered within hours. Whether it’s fixing pipeline issues or adding features, this tight feedback loop keeps the endpoints reliable and evolving. Powering the community with fast iteration is part of our commitment.

August 2025: When Wan 2.2 Changed Everything

By August 2025, we knew Alibaba was preparing for the launch of Wan 2.2, so we started preparing early by experimenting on Wan 2.1, treating it as a proxy. Our hunch was correct: 2.2 was more of an adaptation than a complete overhaul, unlike a hypothetical “3.0” release. That prep work gave us a head start.

Day 1: We launched multi-GPU inference with advanced caching, kernel compilation, and tweaks for the dual-transformer architecture. It worked, but it required 8× H100 GPUs per run.
Day 2: We additionally applied guidance and step distillation, a method well-suited for flow-matching models (inspired by Black Forest Labs’ work to obtain Flux Schnell). That one change reduced the requirement from 8 GPUs to 1 while maintaining the same speed.

Suddenly, all endpoints (i2v, t2v) were running on a single GPU, affordable and fast enough to scale. The shift wasn’t incremental; it was transformative for the creative industry.

100× cheaper than alternatives like Veo, which shocked the community on X (see below).
In the first month, usage climbed toward 1 million runs, with 75% concentrated on image-to-video.
That flagship endpoint could generate a 480p, 5-second video in 8.7 seconds, making it the fastest, most affordable video model available.

Where We Go From Here?

The partnership keeps growing. Whenever a new open-weight model gains traction, we’re ready to optimize and ship it together (looking at you, Wan Animate!). We’re also experimenting with co-model creation, just like we did with Wan Image. A future initiative might even include a new generation of models custom-made to unlock new capabilities and become the production-grade building blocks of entire industries. More on that soon

We hope you enjoyed the read!

If you’re a model creator, we help you take inference to the next level (don’t take our word for it—read this blog from our friends at Bria).
If you’re an inference provider, we’re always keen to explore new partnerships—come chat with us!

PS: much love to our friends at Replicate, we’re super happy to be building together <3

Back to articles