Open-Source Library

Make your AI Model Efficient with Pruna

Make your AI Model Efficient with Pruna

Learn about optimization, Optimize your own model with multiple SOTA algorithms, and deploy it on any platform.

Learn about optimization, Optimize your own model with multiple SOTA algorithms, and deploy it on any platform.

pip install pruna

Copied

pip install pruna

Copied

pip install pruna

Copied

Pruna Package

Get a faster inference without the trial-and-error process.

Combine 50+ state-of-the-art compression algorithms (including pruning, quantization, caching, and more!) to make your own optimized models.

Benchmark your AI model's efficiency and quality.

Pruna Models

Get a faster inference without the trial-and-error process.

Run 10K+ AI-optimized models in free access on Hugging Face, covering image, video, and text.

Make Your Own AI Model

Faster, Smaller, Cheaper, Greener!

Pruna AI Optimizing Image &
Video generation models

Make Your Own AI Model

Faster, Smaller, Cheaper, Greener!

By using Pruna OSS, you gain access to the most advanced optimization engine, capable of smashing any AI model with the latest compression methods for unmatched performance.

Flux Kontext

Janus-Pro-7B

Flux Dev

Flux Kontext

Janus-Pro-7B

Flux Dev

Flux Kontext

Janus-Pro-7B

Flux Dev

Learn about all families of compression methods!

Pruning

Pruning removes less important or redundant connections and neurons from a model, resulting in a sparser, more efficient network.

Quantization

Quantization reduces the precision of the model’s weights and activations, making them much smaller in terms of memory required.

Batching

Batching groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing overall processing time.

Enhancing

Enhancers improve the quality of the model’s output. They range from post-processing to test time compute algorithms.

Caching

Caching is a technique used to store intermediate results of computations to speed up subsequent operations, particularly useful in reducing inference time for machine learning models.

Recovery

Recovery restores the performance of a model after compression.

Factorization

Factorization batches several small matrix multiplications into one large fused operation which, while neutral on memory and raw latency, unlocks notable speed-ups when used alongside quantization.

Distillation

Distillation trains a smaller, simpler model to mimic a larger, more complex model.

Compilation

Compilation optimizes the model for specific hardware.

Distributers

Distributers distribute the model or certain calculations across multiple devices, improving computational efficiency and reducing overall processing time.

Pruna Course

Get a faster inference without the trial-and-error process.

Learn how to compress, evaluate, and deploy efficient AI models from theory to practice.

Pruna Materials

Get a faster inference without the trial-and-error process.

Stay up to date with the most recent AI optimization literature.