Choose the flexibility that suits you: Pruna can be self-hosted, launched directly from the AWS Marketplace, or self-hosted with Docker or deployed via Koyeb, Replicate and more.
LoRAs are an extremely convenient tool for improving your model’s capacities. However, swapping LoRAs can trigger a new compilation time. Pruna ensures that Diffusers-driven LoRAs are efficient and don’t cause recompilation warmups.
Our solution includes a compilation Node for optimized execution and three Caching Nodes to reuse the computation. Pruna Note is the fastest solution for diffusion models.
We're actively working to bring full vLLM compatibility to Pruna. You can already load Pruna-optimized models using supported quantizers like AutoAWQ, BitsAndBytes, GPTQ, and TorchAO.
Our team is working on improving compatibility between Pruna and VLLM.
Curious how? Let’s chat.
Learn more about Integrations with our blog articles